Econometrics with R - 2 Multiple Regression

2.1 Topics of this chapter

Motivation for a multiple linear regression model, MLR model
Interpretation of the MLR model
Computation of the coefficients of the MLR model
Frisch-Waugh-Lovell (FWL) Theorem
Algebraic properties of OLS estimates
Standard assumptions for the MLR model
Unbiasedness of OLS
Variance of OLS
Efficiency of OLS - Gauss-Markov Theorem
Misspecification
- Overspecification
- Underspecification - omitted variable bias
- Errors in the variables and proxy variables
- Random coefficient model
- Functional forms
OLS in matrix notation

2.2 Definition of the multiple linear regression model

\text{ }

Explains variable y in terms of variables x_1, x_2, \ldots , x_k

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + u \tag{2.1}

y, … dependent variable
x_j, … independent variables, explanatory variables, regressors
\beta_0, … intercept (coefficient of x_0, which is a vector of ones)
\beta_j, \ j=1, \ldots, k, … slope parameters
u, … error term, disturbance -> represents unobservable factors and influences

2.3 Motivation for multiple regression

Incorporate more explanatory factors into the model. This is important for testing economic theories and better predictions
Explicitly holding fixed other factors that otherwise would end up in u and probably cause a correlation between u and the remaining explanatory variable. If this is actually the case, a causal interpretation of the model is not possible (z_j \rightarrow x \text{ and } z_j \rightarrow y case)
- Avoid omitted variable bias
- Can mimic experimental data with random assignment to some extent
Allow for more flexible functional (non-linear) forms

2.3.1 Simulation study

To give a more in-depth motivation for multiple regression models we simulate a simple market model with demand and supply function.

We want to estimate the market demand function for a certain good, i.e., the demanded quantities in dependence on prices;

Q_D=f(P) \tag{2.2}

The true demand function is supposed to be

Q_D = a_1 P + b_1 Y + u, \quad \ a_1<0, \ \ b_1>0 \tag{2.3}

So, demand Q_D negatively depends on prices P (mainly substitution effects) and positively on income Y. u represents an error term of unobserved factors which likewise affect demand
We additionally have the supply function, that positively depends on prices P and negatively on wages W. v represents an error term of unobserved factors affecting supply Q_S

Q_S = a_2 P + b_2 W + v, \quad \ a_2>0, \ \ b_2<0 \tag{2.4}

Because we are interested only in the price effect, we decide to apply a simple regression model with only one explanatory variable – prices. Is the approach promising? As we will see, the answer is no!

Digression: Reduced forms and identification

Reduced form of a structural model and observational equivalence

Assume that the equilibrium condition Q_D=Q_S=Q is always met and rewrite our demand and supply functions in the following way, so that the two endogenous variable Q and P are on the left hand side and the two variables Y and W along with the two error terms u and v remain on the right side. The index i simply indicates that the demand and supply relationships hold for each observation i. ¹

Q_i - a_1 P_i \ = \ b_1 Y_i + u_i \tag{2.5}

Q_i - a_2 P_i \ = \ b_2 W_i + v_i \tag{2.6}

Thereby, we assume that the variables Y, W are exogenous – determined outside the model – and have nothing to do with the two error terms; i.e., there exit no other equations which can explain Y, W in terms of the errors u,v. This basically implies that E([Y_i, W_i] \mid u, v)= [0, 0] so that the conditional mean independence assumptions is met.

Equation 2.5 and Equation 2.6 constitute the so called structural model with the structural parameters a_1, a_2, b_1, b_2, which we are interested in and want to estimate.

This market model can be easily written in matrix notation:

\underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \times \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \times \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i} \tag{2.7}

Or shorter:

\mathbf A \, \mathbf y_i \ = \ \mathbf B \, \mathbf x_i \ + \ \mathbf e_i

This matrix equation can be solved by multiplying the equation from the left with the inverse of matrix \mathbf A. (Remember that \mathbf A^{-1}\mathbf A = \mathbf I.)

\mathbf y_i \ = \ \mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i \tag{2.8}

Or more compact:

\mathbf y_i \ = \ \mathbf \Pi \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i \tag{2.9}

Equation 2.9 is the so-called Reduced Form of the model. This shows how the endogenous variables in \mathbf y_i (in our case [Q_i, P_i]') solely depend on the exogenous variables in \mathbf x_i (in our case [Y_i, W_i]') and on the exogenous error term \mathbf e_i ([u_i, v_i]' in our case).

The reduced form represents the so-called data generating process because the model is driven by the values of the exogenous variables and thereby generates the values of the endogenous variables.

Thus, the reduced form of the model is used to simulate the data for our little simulation study in the text. Thereby, the exogenous variables Y_i, W_i, u_i, v_i are “replaced” with random vectors and Equation 2.9, respectively Equation 2.10 are used to calculate the corresponding values for Q_i, P_i

Important note: The reduced form parameter matrix \mathbf \Pi can be consistently estimated by OLS as all right hand side variables in Equation 2.9 are exogenous. Hence, the conditional mean independence assumption is always met for a (“true”) reduced form by assumption.

Identification

If the model is just identified, the structural parameters of Equation 2.7 can be recovered from the estimated reduced form parameter matrix \mathbf \Pi.

In particular, for our market model, the reduced form is:

\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]\ = \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \times \left[\begin{array}{c} Y_i \\ W_i \end{array}\right] \ + \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1}\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right] \tag{2.10}

The inverse of \mathbf A is:

\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right]

This leads to the reduced form parameter matrix \mathbf \Pi = \mathbf A^{-1} \mathbf B:

\mathbf \Pi \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right] \times \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 b_1 & a_1b_2 \\ -b1 & b_2 \\ \end{array}\right] \tag{2.11}

To recover the structural parameters a_1, a_2, b_1, b_2, we utilize the fact that the formula above implies a relationship for every of the four elements of the estimated \hat {\mathbf \Pi}:

\hat \pi_{11} = -\dfrac {1}{a_1-a_2} a_2b_1 \quad \quad \hat \pi_{12} = \dfrac {1}{a_1-a_2} a_1b_2

\hat \pi_{21} = -\dfrac {1}{a_1-a_2} b_1 \quad \quad \hat \pi_{22} = \dfrac {1}{a_1-a_2} b_2 \\

These are four equations in the four structural parameters a_1, a_2, b_1, b_2. The solution for this non-linear system of equations is:

a_1 = \frac{\hat \pi_{12}}{\hat \pi_{22}} \quad \quad a_2 = \frac{\hat \pi_{11}}{\hat \pi_{21}}

b_1 = \hat \pi_{11} - \frac{\hat \pi_{12}\hat \pi_{21}}{\hat \pi_{22}} \quad \quad b_2 = \hat \pi_{12} - \frac{\hat \pi_{11}\hat \pi_{22}}{\hat \pi_{21}}

Note that we had imposed two zero-restrictions on the matrix \mathbf B. In particular, these restrictions mean that variable W_i is not part of the demand function and Y_i is not part of the supply function.

If these zero-restrictions are not imposed, we would have to estimate two additional structural parameters which rises the number of structural parameters to six. But six parameters cannot be recovered from four equations in a unique manner. Hence, the structural form would not be uniquely determined by the reduced form.

This means that there exist other observational equivalent structural forms which have the same reduced form – the same data generating process, which generates the very same observations of the endogenous variables Q and P. Therefore, the structural model is not identified (estimable) in this case.

To demonstrate this very important issue, suppose we multiply the original structural model Equation 2.7 with an invertible, but otherwise arbitrary transformation matrix T, \ \text { with } \ t_1 \cdot t_2 \neq 1:

\underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \ \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \ \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i}

Or more compact:

\mathbf A^* \, \mathbf y_i \ = \ \mathbf B^* \, \mathbf x_i \ + \ \mathbf e^*_i \tag{2.12}

with:

\mathbf A^* \ := \ \mathbf T \mathbf A \ = \ \left[\begin{array}{cc} (1+t_1) & (-a_1-t_1a_2) \\ (1+t_2) & (-t_2a_1-a_2) \\ \end{array}\right]

\mathbf B^* \ := \ \mathbf T \, \mathbf B \ = \ \left[\begin{array}{cc} b_1 & t_1b_2 \\ t_2b_1 & b_2 \\ \end{array}\right] \tag{2.13}

\mathbf e^*_i \ := \ \mathbf T \, \mathbf e_i \ = \ \left[\begin{array}{c} u_i+t_1v_i & \\ v_i+t_2u_i \\ \end{array}\right]

Equation 2.12 is a new and different structural model, with different demand and supply functions.²

The important point is that the two different model structures have the same reduced form and thus exhibit the same data generating process yielding the very same observations – they are observational equivalent.

To proof this claim we calculate the reduced form of the transformed model Equation 2.12

\mathbf y_i \ = \ \mathbf A^{*-1} \mathbf B^* \mathbf x_i \ + \ \mathbf A^{*-1} \mathbf e^*_i \ =

(\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ (\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf e_i \ = \tag{2.14}

\mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf e_i \ =

\mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i

This is identical to Equation 2.8, which was the reduced form of the original structural model. This proofs the observational equivalence of the original and the alternative (transformed) structure. (To derive this result we used rule Equation B.4 and the fact that \mathbf T^{-1} \mathbf T = \mathbf I.)

Note that the two zero-restrictions in the \mathbf B matrix disappeared in the \mathbf B^* matrix of the new structure, see Equation 2.13. This means that the applied transformation is not allowed if we strongly believe that the two zero-restrictions are true and a feature of the true structure. Hence, these are identifying restrictions, ruling out such a transformation of the model.

Corollary: Our analysis showed that we always need some identifying restrictions to be able to estimate a structural equation as otherwise, it is always possible to find alternative models, which are observational equivalent.

Remark: It is clearly possible that some equations of the structural model are identified and some not. For instance, if we impose the zero-restriction for the demand function and not for the supply function, only the demand function would be identified (estimable).

Simulating the market model

First, we compute simulated data according to the specified market model above with random numbers for the exogenous variables and error terms Y, W, u, v
- We define a_1=-1, b_1=1, a_2=1, b_2=-1, sd(Y)=3 and sd(W)=2, sd(u)=0.1 and sd(v)=1
- Remark: Thereby, we employed an (incredible) identifying restriction to be able to estimate the demand function with OLS. The used restriction is that the variance of the error term u is very small. This assumption prevents a noticeable feedback mechanism which would lead to a violation of the conditional mean independence assumption in the demand equation ³

Code

# Simulating data for the market model

# Demand : Q = a1*P + b1*Y + u
# Supply : Q = a2*P + b2*W + v

# y = (Q, P)', x = (Y, W)', e = (u, v)'

# In matrix language: A*y = B*x + e   =>   
# The solution of the model is:
# y = A^-1*B*x + A^-1*e
# which is the so called reduced form, or data generating process

# Structural parameters
a1=-1
a2=1
b1=1
b2=-1

my = 20 # mean of Y
mw = 5  # mean of W

# Model matrices
A = rbind( c(1, -a1), c(1, -a2) )
B = rbind( c(b1, 0), c(0, b2) )

# Many firms over many periods (pooled cross section)
n=1500
u = c( rnorm(n, 0, 0.1) )  # very small variance of u 
v = c( rnorm(n, 0, 1) )   
Y = c( rnorm(n, my, 3) ) 
W = c( rnorm(n, mw, 2) ) 

x = rbind(Y, W)
e = rbind(u, v)

# Model solution 
y = solve(A) %*% B %*% x + solve(A) %*% diag(2) %*% e

# Simulated Q and P
Q = y[1, ]
P = y[2, ]

# Factor variable for Y
Y_ = cut( Y, c(0, 13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 50) )

Estimating the true model of demand

Afterwards, we employ a multiple regression model to estimate the demand equation (Equation 2.3) which yields nearly perfect parameter estimates for the true values a_1=-1 and b_1=1 (because we have 1000 simulated observations).

library(AER) # to use `coeftest()`
# True model of demand, with true price effect equal to -1 
# and true income effect equal to 1.  
# The multiple regression model yields nearly prefect parameter estimates.
coeftest( lm(Q ~ P + Y) )

       
       t test of coefficients:
       
                     Estimate Std. Error   t value Pr(>|t|)    
       (Intercept) -0.0020420  0.0181552   -0.1125   0.9105    
       P           -1.0019110  0.0023401 -428.1500   <2e-16 ***
       Y            1.0011432  0.0014432  693.6946   <2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimating the simple (misspecified) model of demand

Now, we employ the simple regression model, Q = c_0 + c_1 P + u

# Simplified model of demand, ignoring income Y 
# We are finding a wrong positive (!) price effect 
DS <- lm(Q ~ P)
coeftest( DS )

       
       t test of coefficients:
       
                   Estimate Std. Error t value  Pr(>|t|)    
       (Intercept) 3.558600   0.312607  11.384 < 2.2e-16 ***
       P           0.314363   0.024584  12.787 < 2.2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This yields a strongly biased price effect! Why?
- The simple regression model ignores income Y, which ends up in the error term u
  - But income is positively correlated with prices P: Higher income \rightarrow more demand and therefore higher prices, leading to a violation of the conditional mean independence assumtion in the demand function: We have E(u(Y) \mid P) > 0
- Hence, we are finding a wrong positive (!) price effect. The true partial effect is -1
Actually, we estimate a sort of mixture of the demand and supply function
What is the reason for this result. The following Figure 2.1 should shed some light on the problem

Plotting the simulated data

Panel a) of Figure 2.1 shows a standard market model with a demand function (in red) and a supply function (in blue). The observed data of Q and P are the result of the intersections of the demand and supply functions – market equilibrium points. These functions vary because of changing values of income Y and wages W and the error terms u and v
- Note that the realized data points do not mimic the supply nor the demand function
Panel b) of Figure 2.1 shows the regression line of a simple regression model of Q on P, which we just estimated above. The slope value of 0.3143628 does in no way represent the parameter a_1 of the true demand function Equation 2.3

Code

library(ggplot2)

# FWL-theorem

# demand equation
out_qy <- lm(Q ~ Y)
res_qy <- residuals(out_qy)

out_py <- lm(P ~ Y)
res_py <- residuals(out_py)

outQP_y <- lm(res_qy ~ res_py)


# supply equation
out_qw <- lm(Q ~ W)
res_qw <- residuals(out_qw)

out_pw <- lm(P ~ W)
res_pw <- residuals(out_pw)

outQP_r <- lm(res_qw ~ res_pw)


# Plotting

ggplot( mapping=aes(x=P, y=Q))  +
  geom_point() +
  geom_abline(intercept=b1*my-(-4:4), slope=a1, color="#cd0000", linewidth=0.4, linetype = "dashed") +
  geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) +
  geom_abline(intercept=b2*mw-(-4:4), slope=a2, color="#0066cc", linewidth=0.4, linetype = "dashed") +
  geom_abline(intercept=b2*mw, slope=a2, color="#0066cc", linewidth=1) +
  theme_bw()

ggplot( mapping=aes (x=P, y=Q))  +
  geom_point( ) +
  geom_smooth(method=lm, se=F, linewidth=1, color="#aa55aa") +
  theme_bw()

ggplot( mapping=aes (x=P, y=Q) )  +
  geom_point( aes(color=Y_ ) ) +
  geom_smooth( aes(color=Y_ ), method=lm, se=F, linewidth=1) +
  geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) +
  annotate("text", x = 12.3, y = 13, label = "Demand functions Q(P, Y) for given Y", size=6, color="#4488AA") +
  theme_bw()

ggplot( mapping=aes (x=res_py, y=res_qy) )  +
  geom_point( aes(color=Y_ ) ) +
  geom_smooth( method=lm, se=F, linewidth=1,   color="#dd0000") +
  theme_bw()

(a) Market equilibrium points – intersection points of shifting supply (blue) and demand (red) function (because of varying Y, W, u, v). These are the points (Q, P) we observe

In panel c) we are plotting the same data as in b) but highlighting different levels of income with different colors. We further draw separate regression lines for the different income levels
- This reveals the true negative (!) demand-relationship between quantity demanded and prices, given certain levels of income
Panel d) shows the demand relationship for constant income, which mimic the true price effect on demand. We accomplished this by purging the quantity and price data from effects of income – controlling for income. This is an application of the very important FWL-Theorem discussed latter in Section 2.5
To demonstrate the relationship between quantities and prices shown in panel c) we estimate a simple regression model, but only for a very narrow range of income. This procedure leads to a correct estimation of a_1 in a simple regression framework, as shown in panels c) for several different given income ranges

# We regress Q on P, but only for the subset that Y is from the interval [19.8, 20.2]
coeftest( lm(Q ~ P, subset = (Y>19.8 & Y<20.2) ) )

       
       t test of coefficients:
       
                    Estimate Std. Error t value  Pr(>|t|)    
       (Intercept) 20.031322   0.183204 109.339 < 2.2e-16 ***
       P           -1.003507   0.014417 -69.607 < 2.2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This fact demonstrates that the estimated coefficients of a multiple regression model represents the effect of one explanatory variable on the dependent variable, holding the other explaining variables at a constant level.

2.3.2 Empirical example: Wage equation

Measuring the effect of education on wages, but this time explicitly holding experience fixed

The model: \log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + u
\text{ }

library(AER)  # AER is a general econometric package 
library(wooldridge)  # the wooldridge package includes all data sets used in the textbook  
data(wage1)  # loading data set wage1 of the wooldridge  package

# regressing log(wage) on educ, storing results in list model1

model1 <- lm(lwage ~ educ, data = wage1)

# regressing log(wage) on educ and exper, storing results in list model2

model2 <- lm(lwage ~ educ + exper, data = wage1)

Showing estimation results

# show estimated parameters for:
model1

      
      Call:
      lm(formula = lwage ~ educ, data = wage1)
      
      Coefficients:
      (Intercept)         educ  
          0.58377      0.08274

model2

      
      Call:
      lm(formula = lwage ~ educ + exper, data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper  
          0.21685      0.09794      0.01035

Interpretation of the multiple regression model

The multiple linear regression model manages to hold the values of other explanatory variables fixed even if, in reality, they are correlated with the explanatory variable under consideration
Hence, we have a “ceteris paribus”-interpretation
It has still to be assumed that unobserved factor in u do not change as a result of changing explanatory variables (conditional mean independence assumption)
In our particular wage equation example, the effect of one year more education is 0.083 on log(wage), or (times 100) 8.3% on wage, if other effects are not controlled for. But in this case, some of the effect could stem from more experience, which is possibly negative correlated with education
If we additionally include a variable for experience, exper, the effect of education on wage increases to 9.8%. That means:
If we compare two persons with the same level of experience, but person A has one year more education than person B, we predict person A to have a wage that is 9.8% higher than that of person B
Hence, in some sense, a multiple regression model “simulates” a controlled experiment by holding constant other relevant factors
The error term contains all the other variables which we do not control for (maybe because we have no data; e.g., ability in our wage equation which probably has an important impact on wages)

Interpretation of coefficients of the multiple regression model

If the depended variable is in logs, we have a semi elasticity

\beta_j = \dfrac {\partial \log(y)}{\partial x_j} = \dfrac {\partial y/y}{\partial x_j}

This measures how many percent y changes (for instance 0.05) if x_j changed by one unit (we have to multiply \beta_j by 100, to get the percentage value 5%)
If both the depended and explanatory variable are in logs, we arrive to an elasticity, which is measured in percent

\beta_j = \dfrac {\partial \log(y)}{\partial \log(x_j)} = \dfrac {\partial y/y}{\partial x_j/x_j}

If only the explanatory variable is in logs, we have

\beta_j = \dfrac {\partial y}{\partial \log(x_j)} = \dfrac {\partial y}{\partial x_j/x_j}

This measures how many units does the dependent variable change if \log(x_j) is increased by one unit. So, you have to divide \beta_j by 100, to get the effect of an 1% increase in x_j on y

Non-linearities

# plot of effects of exper on lwage
plot(wage1$lwage ~ wage1$exper)

There seem to be a nonlinear effect of experience on wages
So, we try the model specification:

\log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + u

# modeling a non-linear effect of exper
model3 <- lm(lwage ~ educ + exper + I(exper^2), data = wage1)
model3

      
      Call:
      lm(formula = lwage ~ educ + exper + I(exper^2), data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper   I(exper^2)  
        0.1279975    0.0903658    0.0410089   -0.0007136

In this particular case, the partial effect of exper is

\dfrac{\partial lwage}{\partial exper} = \beta_2 + 2\beta_3exper = 0.041 - 0.00143 exper

Hence, this partial effect is not constant but is a function of exper. Often, these partial effects are calculated
- at the average of the explanatory variables - Partial Effect at Average, PEA
- or as the effect averaged over all observations, APE
Partial effects are sometimes inconvenient to calculate, especially with products of variables - interaction terms. But there are packages in R, for instance marginaleffects, which calculates the average partial effects and present these in tables or graphically

# Calculating the APE for the model above with "marginaleffects"
library(marginaleffects)
avg_slopes(model3)

      
        Term Estimate Std. Error     z Pr(>|z|)     S  2.5 % 97.5 %
       educ    0.0904    0.00747 12.10   <0.001 109.6 0.0757 0.1050
       exper   0.0167    0.00182  9.17   <0.001  64.1 0.0131 0.0203
      
      Type:  response 
      Comparison: dY/dX

# with "plot_slopes" from the package "marginaleffects" we can plot the partial effect from above
# depended on "exper"
plot_slopes(model3, variable = "exper", condition = "exper") + theme_bw()

# with "plot_predictions" from the package "marginaleffects" we can plot the partial effect on the 
# predicted values of wages depended on "exper"
plot_predictions(model3, condition="exper") + theme_bw()

2.4 Computing the coefficients of the multiple regression model

As in the simple regression model we assume a random sample of y_i and x_{i,j}
We define the regression residuals: \hat u_i := \ y_i - \hat \beta_0 - \hat \beta_1 x_{i,1} - \hat \beta_2 x_{i,2} - \ldots - \hat \beta_k x_{i,k} \tag{2.15}
We choose \hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k so that the sum of squared residuals is minimized

\underset{\hat{\beta}}{\min } \sum_{i=1}^{n} \hat u_i^2 \ \rightarrow \ \hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k

This leads to the following formula for \hat \beta_1

\hat{\beta}_{1} \, = \, \frac{ \sum_{i=1}^{n} \hat{r}_{i,1} \ y_{i}}{\sum_{i=1}^{N} \hat{r}_{i,1}^{2}} \tag{2.16}

If we compare this formula with the one for the single regression model, we notice that (x_i - \bar x) is now replaced with \hat r_{i,1}
Thereby, \hat r_{1} are the residuals of a regression of x_{1} on all the other explanatory variables of the model, x_0, x_{2}, \ldots , x_{k}
- This means, only the part of x_{1} that is uncorrelated with all other x_{2}, \ldots , x_{k} is relevant for estimation of \beta_1
Formula from Equation 2.16 shows that the multiple regression model from Equation 2.1 is equivalent to a transformed, single regression model

y \ = \ \beta_0 + \beta_1 \hat r_{1} + v \tag{2.17}

This way, we get the partial effect of x_1 on y, holding constant (or controlling for the effects of) all other x-es
This fact is the result of an important general theorem, the Frisch-Waugh-Lovell (FWL) Theorem

2.4.1 Matrix notation of the multiple regression model

Expand for analysis in matrix notation

The model

y_1 \ = \ \beta_0 + \beta_1 x_{1,1} + \beta_2 x_{1,2} + \ldots + \beta_k x_{1,k} + u_1 \\ \colon \quad \quad \quad \quad \quad \quad \quad \colon \quad \quad \quad \quad \quad \quad \quad \colon \\ y_n \ = \ \beta_0 + \beta_1 x_{n,1} + \beta_2 x_{n,2} + \ldots + \beta_k x_{n,k} + u_n

is stacked to

\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \tag{2.18}

with

\mathbf{y} = \underbrace{ \left[\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array}\right]}_{n \ \times \ 1}, \ \ \mathbf{X} = \underbrace{ \left[\begin{array}{ccccc} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1, k} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2, k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{array}\right]}_{n \ \times \ (k+1) }, \ \ \boldsymbol{\beta} = \underbrace{ \left[\begin{array}{c} \beta_{0} \\ \vdots \\ \beta_{k} \end{array}\right]}_{(k+1) \ \times \ 1}, \ \ \mathbf{u} = \underbrace{ \left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]}_{n \ \times \ 1}

Principle of ordinary least square:

For calculating the residuals insert an estimate of \boldsymbol \beta in Equation 2.18

\mathbf{y} = \mathbf{X} \boldsymbol{\hat \beta} + \mathbf{\hat u} The residuals are then defined as:

\mathbf{\hat u} \ = \ \mathbf y - \mathbf{X} \boldsymbol{\hat \beta} \ = \ (\mathbf y - \mathbf{\hat y})

The sum of the squared residuals is:

SSR := \ \sum_{i=1}^{n}\left(\hat{u}_{i}\right)^{2} \ =\ \hat{\mathbf{u}}^{\prime} \hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ =

\mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} \\

as \hat{\boldsymbol{\beta}}' \mathbf{X}' \mathbf{y} is scalar and hence equal to (\hat{\mathbf{\beta}}^{\prime} \mathbf{X}' \mathbf{y})' = \mathbf{y}' \mathbf{X} \hat{\boldsymbol{\beta}}.

For minimization of the sum of squared residuals set the first derivative with respect to \boldsymbol{\hat \beta} to zero:

\underset{\hat{\beta}}{\min }(S S R): \ \frac{\partial S S R}{\partial \hat{\boldsymbol{\beta}}} \ = \ \frac{\partial\left(\mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}\right)}{\partial \hat{\boldsymbol{\beta}}} \ = 0 \ \ \Rightarrow

-2 \mathbf{X}' {\mathbf{y}}+2 \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}=0

Solution for the parameter vector

This gives the so called normal equations:

\mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}' \mathbf{y} \\

The solution for the parameter vector \hat{\boldsymbol{\beta}} therefore is:

\hat{\boldsymbol{\beta}}=(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \tag{2.19}

Thereby,

\mathbf{X}^{\prime} \mathbf{X}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{1, k} \\ 1 & x_{2,1} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & \cdots & x_{n, k} \end{array}\right]=

=\left[\begin{array}{cccc} N & \sum_{i} x_{i, 1} & \cdots & \sum_{i} x_{i, k} \\ \sum_{i} x_{i, 1} & \sum_{i} x_{i, 1}^{2} & \cdots & \sum_{i} x_{i, 1} x_{i, k} \\ \sum_{i} x_{i, 2} & \sum_{i} x_{i, 2} x_{i, 1} & \ddots & \sum_{i} x_{i, 2} x_{i, k} \\ \vdots & \vdots & & \vdots \\ \sum_{i} x_{i, k} & \sum_{i} x_{i, k} x_{i, 1} & \cdots & \sum_{i} x_{i, k}^{2} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' \mathbf x_i

and

\mathbf{X}^{\prime} \mathbf{u}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]=\left[\begin{array}{c} \sum_{i} u_{i} \\ \sum_{i} x_{i, 1} u_{i} \\ \vdots \\ \sum_{i} x_{i, k} u_{i} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' u_i \\

with \mathbf x_i being the i^{th} row of the data matrix \mathbf X \, (so, this is a 1 \times (k+1) row vector!). \, \mathbf x_i' \mathbf x_i therefore is a outer product and hence a (k+1) \times (k+1) matrix and \,\mathbf x_i' u_i is a (k+1) \times 1 vector.

Estimator for the variance of the error term: \hat \sigma^2:

\hat \sigma^2 \ = \ \dfrac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{\mathbf {\hat u}' \mathbf {\hat u}} {n-k-1} \tag{2.20}

Methods of Moment estimator in matrix notation

We have the population model for every observation, with \mathbf x_i being a row vector with the (k+1) explanatory variables:

y_i = \mathbf{x}_i \boldsymbol{\beta} + u_i

Additionally, we have assumption 4 (MLR.4): E (\mathbf u_i \mid \mathbf X) = 0, which implies:

E(\mathbf x_i' u_i) \ = \ E_x [E(\mathbf x_i' u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( 0)] \ = \ \mathbf 0

This, furthermore implies:

E(\mathbf x_i' u_i) \ = \ E \left(\mathbf x_i' ( y_i - \mathbf x_i \boldsymbol \beta) \right) \ = \

E(\mathbf x_i' y_i) - E(\mathbf x_i' \mathbf x_i \boldsymbol) \boldsymbol \beta \ = \ \mathbf 0

\Rightarrow \ \ \boldsymbol \beta \ = \ E(\mathbf x_i' \mathbf x_i)^{-1} E(\mathbf x_i' y_i)

If we have a sample of n observations we can implement these (k+1) moment restrictions by replacing the population (theoretical) moments with their sample counterparts to estimate the (k+1) parameters

\boldsymbol{\hat \beta} \ = \ \left( \frac{1}{n} \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \frac{1}{n} \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ \left( \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}

In this case, the Methods of Moments (MoM) estimator is the same as the OLS estimator. This is due to our standard assumption MLR.4.

2.5 The Frisch-Waugh-Lovell (FWL) Theorem

Equation 2.17 above shows that alternatively, one can estimate the coefficient of any explanatory variable of a multiple regression within a simple regression model by a two steps procedure:
- 1 Regress the explanatory variable – in our case x_1 – on all other explanatory variables including the intercept
- 2 Regress y on the residuals from this regression (as in Equation 2.17)
- Imortant remark: If we additionally regress y on the other explanatory variables form above and use the residuals of this regression to replace y in Equation 2.17, we get the same estimate of \hat \beta_1.
  The advantage of this variant is that in this case the residuals of Equation 2.17 are identically to those of Equation 2.1
This is the famous and very important FWL Theorem
Why does this procedure work?
- The residuals from the first regression is the part of the explanatory variable x_1 that is uncorrelated with the other explanatory variables
- The slope coefficient of the second regression (representing Equation 2.17) therefore yields the isolated effect of the explanatory variable on the depended variable, because x_1 is purged from any indirect effects of the other variables

FWL Theorem - an example

We regress Log(wage) on educ, exper and expersq. First, we estimate the usual multiple regression model

# loading a librariy which allows to print the results of several regressions in one table
library(modelsummary)

# We use the wage1 data from the Wooldridge package
data(wage1)

# original model, we are interested in the coef of educ

out <- lm(lwage ~ educ + exper + expersq, data = wage1)
out

      
      Call:
      lm(formula = lwage ~ educ + exper + expersq, data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper      expersq  
        0.1279975    0.0903658    0.0410089   -0.0007136

Applying the FWL theorem

Now, we want to estimate the coefficient of educ via a single regression, applying the FWL theorem. So, educ is x_1 and the other explanatory variables are the intercept, exper, expersq.

Hence, in the first step we regress educ on [intercept,exper expersq]

# regressing educ on exper + expersq

out3 <- lm(educ ~ exper + expersq, data = wage1)
educ_res = out3$residuals

Additionally, we regress the dependent variable lwage on [intercept,exper expersq] and store the residuals in lwage_res. This second step is not necessary, but useful if we want to have the residuals of the original model and carry out t-tests

# regressing lwage on exper + expersq 

out2 <- lm(lwage ~ exper + expersq, data = wage1)
lwage_res = out2$residuals

In the third step, we regress the residuals lwage_res on the residuals educ_res ⁴. The coefficient of educ_res in this regression model should be equal the coefficient of educ in the original regression model.

# regressing lwage_res on educ_res 

out4 <- lm(lwage_res ~ 0 + educ_res)
out4

      
      Call:
      lm(formula = lwage_res ~ 0 + educ_res)
      
      Coefficients:
      educ_res  
       0.09037

As we see, the coefficients are identically. The same is true for standard errors, t-values and residual variance
- Actually, they differ a tiny bit, because the two step procedure does not take into account that we lost degrees of freedom in the first step. But asymptotically, this is irrelevant.

Code

# Using 'modelsummary' for a pretty output

modelsummary(list("Multiple Regression"=out, "Simple Regression"=out4),
            stars = TRUE, 
            fmt = 4,
            statistic = c('statistic'),
            gof_omit = "A|L|B|F",
            align = "ldd",
            output = "flextable"
            )

Table 2.1:
Aplying the FWL Theorem (t-statistics in brackets)
	Multiple Regression	Simple Regression
(Intercept)	0.1280
	(1.2083)
educ	0.0904***
	(12.1004)
exper	0.0410***
	(7.8916)
expersq	-0.0007***
	(-6.1639)
educ_res		0.0904***
		(12.1351)
Num.Obs.	526	526
R2	0.300	0.219
RMSE	0.44	0.44
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

2.5.1 FWL Theorem in matrix notation

Expand for proofing the FWL theorem

Consider a regression model with two groups of regressors. We are especially interested in the effects of \, \mathbf X_1 on \mathbf y, i.e. on \boldsymbol{\hat\beta}_1

\mathbf{y} = \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{X}_2 \boldsymbol{\hat\beta}_2 + \mathbf{\hat u} \tag{2.21}

Define the so called orthogonal projection matrix or residual maker matrix \, \mathbf M_2 as

\mathbf M_2 := \ ( \mathbf I - \underbrace{\mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2'}_{:= \ \mathbf P_2}) = (\mathbf I - \mathbf P_2) \tag{2.22}

Define the so called projection matrix or hat matrix \, \mathbf P_2 as

\mathbf P_2 := \ \mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' \tag{2.23}

Multiplying \mathbf P_2 with an appropriately dimensioned arbitrary vector \mathbf z yields a vector of predicted values \mathbf{\hat z} of a regression of \, \mathbf z on \mathbf X_2:

\mathbf P_2 \mathbf z \ = \ \mathbf X_2 \underbrace{ (\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' \, \mathbf z }_{\boldsymbol {\hat \delta}} \ = \ \mathbf X_2 \boldsymbol {\hat \delta} \ = \ \mathbf{\hat z}

Thereby \boldsymbol{\hat \delta} is the estimated regression parameter vector of the regression: \, \mathbf z = \mathbf X_2 \boldsymbol{\hat \delta} + \mathbf{\hat e}
Multiplying \mathbf M_2 with an appropriately dimensioned arbitrary vector \mathbf z yields the residuals \mathbf {\hat e} of a regression of \, \mathbf z on \mathbf X_2:

\mathbf M_2 \mathbf z \, = \, (\mathbf I - \mathbf P_2) \mathbf z \, = \, \mathbf z - \mathbf P_2 \mathbf z \, = \, \mathbf z - \mathbf X_2 \boldsymbol{\hat\delta} \, = \, \mathbf z - \mathbf{\hat z} \, = \, \boldsymbol{\hat e}

Both projection matrices are of enormous importance in econometrics.
- Both matrices are symmetric and idempotent which means: \, \mathbf P_2 \mathbf P_2 = \mathbf P_2 and \, \mathbf M_2 \mathbf M_2 = \mathbf M_2
- Furthermore \, \mathbf P_2, \mathbf M_2 are orthogonal: \, \mathbf P_2 \mathbf M_2 = \mathbf M_2 \mathbf P_2 = \mathbf 0

Now, we pre-multiply the original multiple regression model, Equation 2.21, with \, \mathbf M_2:

\mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \underbrace{\mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2}_{\mathbf 0} + \underbrace{\mathbf M_2\mathbf{\hat u}}_{\mathbf{\hat u}}

To see that \, \mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2= \mathbf 0 we can simply multiply out. But from a content point of view, it is clear that regressing \mathbf X_2 on \mathbf X_2 yields zero-residuals
Furthermore we have \, \mathbf M_2 \mathbf{\hat u} = (\mathbf I - \mathbf P_2) \mathbf{\hat u} = \mathbf{\hat u}
- \mathbf P_2\mathbf{\hat u} = \mathbf 0 \, because OLS-residuals are always uncorrelated with regressors – \mathbf X_2'\mathbf{\hat u} = \mathbf 0 (orthogonality property of OLS). So, regressing OLS-residulas on some of their regressors leads to a regresson with zero fit and therefore unchanged residuals

Hence, after pre-multiplying the original regression model with \, \mathbf M_2 we have:

\mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{\hat u}

On the left hand side we have the residuals of a regression of \, \mathbf y on \mathbf X_2: \, \mathbf M_2 \mathbf{y}
On the right hand side we have residuals of a regression of \, \mathbf X_1 on \mathbf X_2: \, \mathbf M_2 \mathbf X_1

Thus, regressing the left hand residuals only on the right hand residuals yields the very same estimate \boldsymbol{\hat \beta}_1 and the very same residuals \mathbf{\hat u} as the original multiple regression model. This proofs the content of the FWL Theorem.

Important Remark: If \, \mathbf X_2 only consists of the unit vector \mathbf i = (1,1, \ldots, 1)', we have: \,\mathbf M_{\mathbf i } = (\mathbf I - \frac {1}{n} \mathbf i \, \mathbf i') and \mathbf P_{\mathbf i} \mathbf X_1 = \mathbf{\bar X_1} and therefore \, \mathbf M_{\mathbf i} \mathbf X_1 = (\mathbf{X_1} - \mathbf{\bar X_1}). Hence, intercept \rightarrow demeaning

2.6 Algebraic properties of OLS estimates

Sum of residuals is zero
\sum_{i=1}^n \hat u_i = 0
Follows from the Normal Equation Equation 1.8
Orthogonality of x_j and \hat u_i - zero covariance
\sum_{i=1}^n x_{ij} \hat u_i = 0
Follows from the Normal Equation Equation 1.9
Sample averages of y and of the regressors lie on regression line (regression hyperplane)
\bar{y}=\widehat{\beta}_{0}+\widehat{\beta}_{1} \bar{x}_{1}+\ldots+\widehat{\beta}_{k} \bar{x}_{k}
Follows from Equation 1.10
Orthogonality allows a decomposition of total variation of y (see Section 1.3.6)
SST = SSE + SSR \ \ \rightarrow
Goodness-of-Fit measure: R²
R^2 := SSE/SST = 1 - SSR/SST
Regressing a model, as usual, with an intercept yields the very same results as when regressing all variables demeaned (including y), but without an intercept (follows from the FWL theorem, see proof above, last remark). Hence, the role of the intercept is demeaning all the variables
Alternative expression for R-squared: In OLS model with intercept, R² is equal to the squared correlation between the actual values y and predicted values \hat y

R^{2} := \ \dfrac{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{\hat{y}}\right)\right)^{2}}{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right) \left( \sum_{i=1}^{n} ( \hat{y}_{i}-\bar{\hat{y}} )^{2} \right)} \tag{2.24}

R² never decreases when an additional regressor is added
This suggest using an adjusted measure of fit that accounts for the degrees of freedom, df, of the nominator and denominator:
Adjusted R² or R bar squared

\bar{R}^{2} := \ 1-\dfrac{SSR/(n-k-1)}{SST/(n-1)}=1-\dfrac{n-1}{n-k-1}\left(1-R^{2}\right)

This measure can decrease with additional variables if they do not contribute to explain y in a relevant way (the t-statistic has to be greater than one)

Expand for algebraic properties of OLS estimates in matrix notation

Orthogonality property of OLS

Proposition: The explanatory variables are orthogonal to the OLS residuals:

\mathbf X^{\prime} \hat{\mathbf{u}} \ = \ 0

Proof:

\hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ = \ (\mathbf{y}-\mathbf{X} \underbrace{\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \, \mathbf{X}^{\prime} \mathbf{y}}_{\hat{\boldsymbol{\beta}}}) \tag{2.25}

Multiplying Equation 2.25 by \ \mathbf X'

\Rightarrow \ \ \mathbf{X}' \hat{\mathbf{u}} \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\underbrace{\mathbf{X}^{\prime} \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf I} \, \mathbf{X}^{\prime} \mathbf{y}) \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\mathbf{X}^{\prime} \mathbf{y}) \ = \ \mathbf{0}

Corollary:

\mathbf i' \mathbf {\hat u} \ = \ \sum_{i=1}^{n} \hat{u}_{i} \ =\ 0

as the first row of \mathbf X' is the ones-vector \mathbf i = [1, 1, \ldots, 1].

2.7 Assumptions of the multiple regression model

Here, we are using the notation from Wooldridge (2019)

Assumption 1 (MLR.1): The model is linear in parameters
Assumption 2 (MLR.2): Observations (y_i, x_{i,j}) are the result of random sampling of size n from the population model. (Typically, this assumption does not hold for time series, but this assumption can be relaxed so that the dependencies between successive observations are not too strong respectively, decline fast enough)
Assumption 3 (MLR.3): No perfect collinearity
- Remarks on MLR.3: This assumption only rules out perfect collinearity/correlation between explanatory variables. Constant variables are also ruled out (collinear with intercept)
- An example for perfect collinearity; two candidates and the number of votes depend on the shares of campaign expenditures:
vote \, A = \beta_0 + \beta_1 share A + \beta_2 share B + u
- The shares from candidates A and B sum to one and are therefor perfect collinear with the intercept
- Imperfect correlation is allowed, but leads to the problem of multicollinearity which regularly causes imprecise estimates and will be discussed later
Assumption 4 (MLR.4): Exogeneity of explanatory variables – zero conditional mean or conditional mean independence assumption

E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = E(u_i) = 0
- The values of the explanatory variables must not contain any information about the mean of the unobserved factors. In this case \frac {\partial E(u \mid x)}{\partial x}=0
- Discussion of the zero conditional mean assumption
  - In a multiple regression model, the zero conditional mean assumption is much more likely to hold because fewer things end up in the error term
  - Explanatory variables that are correlated with the error term are called endogenous; endogeneity is a violation of assumption MLR.4
  - Explanatory variables that are uncorrelated with the error term are called exogenous; MLR.4 holds if all explanatory variables are exogenous
  - Exogeneity is the key assumption for a causal interpretation of the partial effects measured by \beta and for unbiasedness of the OLS estimates.
    (Remark: MLR.4 also generally leads to consistent OLS estimates, meaning that with an infinite sample size the parameters can be estimated “infinitely” precise)
- Assumption 4’ (MLR.4’): Weak Exogeneity of explanatory variables
  
  E(u_i \mid x_{i,1}, x_{i,2} \ldots , x_{i,k}) = E(u_i) = 0
  - This is a weaker variant of MLR.4 and only this is generally necessary for consistent estimates if the error term is iid. This variant merely requires that u_i and x_{i,k} are uncorrelated within the same observation, whereas MLR.4 requries that a particular u_i is uncorrelated with x_{k}, which means with every observation of x_{k}
Assumption 5 (MLR.5): Homoscedasticity of u_i

Var(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = \sigma^2
- The value of the explanatory variables must not contain any information about the variance of the unobserved factors
- A violation of this assumptions leads to inefficient estimates and incorrect formulas for statistical tests (both problems can be handled within OLS via appropriately transformed data or heteroscedasticity consistent estimates of the covariance matrix)
Assumption 6 (MLR.6): Conditional normality of u_i

u_i \mid (x_{1}, x_{2} \ldots , x_{k}) \ \sim \ N(0, \sigma^2)
- This assumption is needed for statistical tests in small samples

MLR.4, MLR.5 in matrix notation and the Covariance Matrix

Assumption 4 (MLR.4): Exogeneity of \, \mathbf X \, (zero conditional mean assumption)

E({\mathbf{u}} \ | \ \mathbf{X}) = 0 \tag{2.26}

This is implied by E(u_i \ | \ \mathbf{x}_i) = 0 and MLR.2 (random sampling)
Assumption 4’ (MLR.4’)

E(u_i \ | \ \mathbf{x}_i) = 0 \tag{2.27}

(without random sampling), is weaker than Equation 2.26 and is called weak exogeneity – it requires only non-correlation between \mathbf{x}_i and u_i within a particular observation i and is necessary for consistency, whereas the stronger MLR.4 is necessary for unbiasedness

Assumption 5 (MLR.5): Homoscedasticity of \mathbf u. This assumptions together with the assumption of no autocorrelation in the errors (random sampling) can be stated in a very compact form by assuming a particular covariance matrix

\operatorname {Var} (\mathbf u \ | \ \mathbf X) = \sigma^2 \mathbf I \tag{2.28}

This is implied by \operatorname {Var} (u_i \mid \mathbf{x}_i) = \sigma^2 and MLR.2 (random sampling)

Assumption 5 (MLR.5) can be stated with the help of the concept of a covariance matrix

The covariance matrix of \mathbf u is defined by:

\operatorname {Var} (\mathbf u \ | \ \mathbf X) \ := \ E\left[ \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right) \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right)^{\prime} \right] \ (\text{ by def. }) \ =

E\left[(\mathbf{u}-\mathbf{0}) \cdot(\mathbf{u}-\mathbf{0})^{\prime} \ | \ \mathbf{X} \right] \ (\text { by assumpt. } E(\mathbf{u} \ | \ \mathbf{X})=\mathbf{0}) \ =

E\left[\mathbf{u} \mathbf{u}' \ | \ \mathbf X \right] \ = \ \begin{array}{l} E \ \left[\begin{array}{cccc} u_{1}^{2} & u_{1} u_{2} & \cdots & u_{1} u_{n} \\ u_{2} u_{1} & u_{2}^{2} & \cdots & u_{2} u_{n} \\ \vdots & \vdots & \ddots & \vdots \\ u_{n} u_{1} & u_{n} u_{2} & \cdots & u_{n}^{2} \end{array}\right]_{ \ \mid\ \mathbf X} \ = \end{array}

\begin{array}{l} \left[ \begin{array}{cccc} Var\left(u_{1}\right) & {Cov}\left(u_{1}, u_{2}\right) & \cdots & {Cov}\left(u_{1}, u_{n}\right) \\ {Cov}\left(u_{2}, u_{1}\right) & Var\left(u_{2}\right) & \cdots & {Cov}\left(u_{2}, u_{n}\right) \\ \vdots & \vdots & \ddots & \vdots \\ {Cov}\left(u_{n}, u_{1}\right) & {Cov}\left(u_{n}, u_{2}\right) & \cdots & Var\left(u_{n}\right) \end{array} \right]_{ \ \mid\ \mathbf X} \end{array}

Imposing the assumption of random sampling (or no autocorrelation in the error term) implies that \operatorname {Cov} \left( u_i,u_j \mid \mathbf X \right) = 0, \ \forall i \neq j
Imposing homoscedasticity implies that \operatorname {Var} \left( u_i \, | \, \mathbf X \right) = \sigma^2, \ \forall i
Hence, we have:

\begin{array}{l} = & { \left[\begin{array}{cccc} \sigma^{2} & 0 & \cdots & 0 \\ 0 & \sigma^{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^{2} \end{array} \ \right]} \end{array} \ = \ \, \sigma^2 \mathbf I \\

\overset {MLR.5} {\Rightarrow} \ \ \operatorname {Var} (\mathbf u \ | \ \mathbf X) \ = \ \sigma^2 \ \mathbf I

2.8 Statistical properties of OLS estimates

2.8.1 Unbiasedness of OLS

Theorem 2.1 (Unbiasedness of OLS) \text{ }

Under MLR.1 to MLR.4 OLS parameter estimates are unbiased:

E(\hat \beta_j) \ = \ \beta_j, \ \ \ \forall j
Further, under MLR.1 to MLR.5:

E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \sigma^2

Proof of Theorem 2.1

The first step is a very important alternative representation for \, {\hat{\boldsymbol{\beta}}}, considering the “true” model: \, \mathbf y = \mathbf X \boldsymbol \beta + \mathbf u

\hat{\boldsymbol{\beta}} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime}(\underbrace{\mathbf{X} \boldsymbol{\beta}+\mathbf{u}}_{\mathbf y}) \ =

=\ \underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{X} }_{\mathbf I} \ \boldsymbol{\beta}+ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \ \ \ \Rightarrow

\hat{\boldsymbol{\beta}} \ = \ \boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u}, \ \text{ or } \tag{2.29}

(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}) \ = \ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \tag{2.30}

Proof of \, E({\boldsymbol{\hat \beta}} ) = \boldsymbol{\beta}.

Using representation from Equation 2.29 and using assumption MLR.4 (exogeneity of \mathbf X, i.e., E({\mathbf{u}} \, | \, \mathbf{X}) = 0), we have:

E({\boldsymbol{\hat \beta}} \mid \mathbf{X}) = E \left( \boldsymbol{\beta} + \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \mid \mathbf{X}\right) \ =

\boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{E(\mathbf{u} \mid \mathbf{X})}_{0 \text { by assumption }} = \ \boldsymbol{\beta}

\Rightarrow \ \ E ( \hat{\boldsymbol{\beta}} ) \ = \ E_x \, \underbrace{ \left( E(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \right) }_{\boldsymbol \beta} \ = \ \boldsymbol{\beta}

as the true \boldsymbol{\beta} is a constant and therefore obviously independent of \mathbf{X}.

q.e.d.

Proof of \ E(\hat \sigma^2) = \sigma^2

Given the definition of the orthogonal projection matrix (residual maker matrix) (Equation C.10),

\mathbf M_{_\mathbf X} := \ ( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'})

we have: \, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf y.

And furthermore, \, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf u\,:

{\mathbf{\hat u}} = \mathbf M_{_\mathbf X} \mathbf{y} = \mathbf M_{_\mathbf X} (\underbrace{\mathbf X \boldsymbol{\beta}+\mathbf{u}}_{\mathbf{y}}) = \underbrace{\mathbf M_{_\mathbf X} \mathbf{X} \boldsymbol{\beta}}_{\mathbf{0}} + \mathbf M_{_\mathbf X} \mathbf{u} = \mathbf M_{_\mathbf X} \mathbf{u}

Now, Theorem 2.1 says (don’t forget that \, \mathbf M_{_\mathbf X} is symmetric and idempotent):

E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \left[ \frac{1}{n-k-1} \right] \ E({\mathbf{\hat u}}' {\mathbf{\hat u}}) \ =

\left[ \frac{1}{n-k-1} \right] \ E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \ = \ \sigma^2

We proceed with the conditional expectation:

E(\hat \sigma^2 \, | \, \mathbf X) \ = \ (n-k-1)^{-1} E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}} \, | \, \mathbf X) \ =

(n-k-1)^{-1} E( tr ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ =

because {\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}} is scalar and therfore equal to its trace.

(n-k-1)^{-1} tr( E ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ =

because both E and tr are linear operators and therefore are exchangeable.

(n-k-1)^{-1} tr( E ({\mathbf{u}}' {\mathbf{u}} \, \mathbf M_{_\mathbf X} ) \, | \, \mathbf X) \ =

because of Equation B.6.

(n-k-1)^{-1} tr( \underbrace {E ({\mathbf{u}}' {\mathbf{u}} \, | \, \mathbf X )}_{\sigma^2 \mathbf I} \, \mathbf M_{_\mathbf X} ) \ =

because of MLR.5.

(n-k-1)^{-1} \sigma^2 \, tr( \mathbf M_{_\mathbf X} ) \ =

(n-k-1)^{-1} \sigma^2 \, tr( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \ =

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \right ) \ =

because of Equation B.9.

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({ \underbrace { \mathbf X' \mathbf X(\mathbf X' \mathbf X)^{-1}}_{\mathbf I_{k+1}} } ) \right ) \ =

because of Equation B.6.

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I_n ) - tr ( \mathbf I_{(k+1)} ) \right ) \ =

because of Equation B.9 and the dimension of (\mathbf X' \mathbf X) is (k+1) \times (k+1) and the dimension of \, \mathbf M_{_\mathbf X} is n \times n

(n-k-1)^{-1} \sigma^2 \, \left ( n - (k+1) \right ) \ =

\sigma^2

q.e.d.

This means that in “very many” repeated samples the average of the OLS estimates equal the true values
As unbiasedness is an average property in repeated samples, in a given particular sample, the estimates may still be far away from the true values
The key requirement for that property is the zero conditional mean assumption (MLR.4). Whenever E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) \, \neq \, 0, OLS estimates are biased, i.e., on average, they are false

_ The additional key requirement for the unbiasedness of \hat \sigma_2 is MLR.5

2.8.2 Variance of OLS

Theorem 2.2 (Variance of OLS) \text{ }

Under MLR.1 to MLR.5 the variance of the OLS estimates is:

{Var(\hat \beta_j \mid x_{1}, \ldots , x_{k})} \ = \ \dfrac{\sigma^2}{ \underbrace{SST_{j}}_{\sum_{i=1}^n (x_{i,j} - \bar x_j)^2} (1-R_j^2) } \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_{x_j}^2} \dfrac {1}{(1-R_j^2)} \tag{2.31}

Proof of Theorem 2.2 and the covariance matrix of \, \boldsymbol{\hat \beta}

The claim of Equation 2.31 for a slope parameter is quite obvious although a little bit tedious.
From Theorem 1.2 we know, that for the simple regression model we have:

\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac {\sigma^2}{SST_x} \tag{2.32}

From Equation 2.17 and Equation 2.16 respectively, the discussion about the FWL-theorem we know that a multiple regression model can be reduced to a simple one, for every particular parameter we want to estimate, for instance for \beta_j.

Hence, Equation 2.31 is also relevant for a any parameter j of a multiple regression regression.

However, the x-variable in regression Equation 2.17 (which was \hat r_1) consists itself of the residuals of a first stage regression of x_j on [1, x_1, \ldots, x_{j-1}, \ \ \ x_{j+1}, \ldots, x_k] and therefore SST_x from Equation 2.32 above is equal to the SSR_j of that regression of x_j on $[1, x_1, , x_{j-1}, , x_{j+1}, , x_k]$$.

SST_x \ = \ \sum_{i=1}^n (\hat r_i - \bar {\hat r}_i)^2 \ = \ SSR_j

Furthermore, we know from Equation 1.14 that SST_j = SSE_j + SSR_j, which implies:

SSR_j = SST_j - SSE_j \tag{2.33}

whereby SST_j = \sum_{i=1}^n (x_{i,j} - \bar x_j).

From the definition of R^2, Equation 1.15, we have for the R^2_j of the first stage regression, R^2_j = \frac {SSE_j}{SST_j}, which implies:

SSE_j = SST_j \cdot R^2_j

Inserting this into Equation 2.33 yields:

SSR_j \ = \ SST_j - SST_j \cdot R^2_j \ = \ SST_j(1-R^2_j)

So, we can substitute SST_j(1-R^2_j) for SST_x in Equation 2.32 and arrive to Equation 2.31.

q.e.d.

{Var(\hat \beta_j \mid x_{1}, x_{2} \ldots , x_{k})} from Equation 2.31 is the j^{th} diagonal element of the covariance matrix for \boldsymbol {\hat \beta}. So, lets calculate that covariance matrix.

The covariance matrix of {\mathbf {\hat \beta}}

Under the assumption MLR.5

\operatorname{Var}(\mathbf u \mid \mathbf X) = E (\mathbf{u u}' \mid \mathbf{X}) = \sigma^2 \mathbf I \, ,

unbiasedness of \, \boldsymbol {\hat \beta}, Equation C.4 and the definition of a covariance matrix, Equation C.6, we have:

\operatorname{Var}(\boldsymbol{\hat \beta} \mid \mathbf{X}) := \ E\left[(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))^{\prime} \mid \mathbf{X}\right]\ =

E\left[(\boldsymbol{\hat \beta}-\boldsymbol{\beta})(\boldsymbol{\hat \beta}-\boldsymbol{ \beta})^{\prime} \mid \mathbf{X}\right] \ =

E [ (\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{(\hat{\beta}-\beta)} )(\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u}}_{(\hat{\beta}-\beta)^{\prime}})^{\prime} \mid \mathbf{X}] \ =

E\left[ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u} \mathbf{u}^{\prime} \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mid \mathbf{X} \right] \ =

\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{ \ E\left(\mathbf{u u}' \mid \mathbf{X}\right)}_{\sigma^{2} \mathbf I} \ \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}

\Rightarrow \ \ \operatorname{Var}( \boldsymbol{\hat\beta} \mid \mathbf X) \ = \ \sigma^{2}(\mathbf{X}' \mathbf{X})^{-1} \tag{2.34}

Applying formula Equation C.3 and variance formula Equation C.7 we directly get:

\operatorname{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \ = \ \operatorname{Var} [ \underbrace{ \boldsymbol{\beta} +\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{ \hat{\boldsymbol{\beta}}} \mid \mathbf{X}] \ =

\underbrace{ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime}}_{\mathbf A} \ \underbrace{ \operatorname{Var}(\mathbf{u} \mid \mathbf{X})}_{\mathbf \Sigma} \ \underbrace{\mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf A'} \ =

\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \ \sigma^{2} \mathbf{I} \ \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \ =

\sigma^{2}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}

Besides MLR.4, the key requirement for this theorem is MLR.5, homoscedasticity. If this assumption is violated, i.e, the variances of u_i depend on the explanatory variables (the x-es), the formula above is simply false (for this case, there exits a modified formula)

Factors determining \operatorname {Var}(\hat \beta_j \, | \, \mathbf x)

According Equation 2.31, the variance of \hat \beta_j

increases with \sigma^2, the variance of the error term
- a high error variance increases the sampling variance of \beta_j because there is more “noise” in the equation; this necessarily makes estimates more imprecise
- Remark: The error variance itself does not decrease with sample size
decreases with SST_{j}, the total sum of squared (x_j-\bar x_j)
- generally, more sample variation in the x-es leads to more precise estimates
- SST_{j} automatically increases with the sample size n. Increasing the sample size is thus a way to get more precise estimates
  - These two factors are identical to the term \frac {1}{n} \frac {\sigma^2}{s_{x_{(j)}}^2} we already had with the simple regression model – compare Theorem 1.2.
increases with R_j^2, the R-squared of a regression of x_j on all the other right hand variables. This is the new aspect of a multiple regression model
- In particular, this is a measure of the degree of multicollinearity between the x-es. The higher the correlation between x_j with the other x-es, the lower the independent variation of x_j and the less additional information contained in the variations of x_j
- Multicollinearity may be measured by the variance inflation factor: VIF_j := 1 / (1-R_j^2). As an (quite arbitrary) rule of thumb, the variance inflation factor should not be larger than 10

Thus, the variance of \hat \beta_j in a multiple regression model corresponds to the variance of the simple regression model times the variance inflation factor, see Equation 2.31

Variance estimation

According Theorem 2.1, an unbiased estimator of \sigma^2 is given by

\hat \sigma^2 \ = \ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{SSR}{df} \tag{2.35}

Given Equation 2.31, an unbiased estimator for the variance of \hat \beta_j is

\widehat{Var(\hat \beta_j)} \ = \ \dfrac{\hat \sigma^2}{SST_{j} (1-R_j^2)} \tag{2.36}

And for the standard error of \hat \beta_j

se(\hat \beta_j) = \sqrt{\widehat{Var(\hat \beta_j)}} \ = \ \dfrac{\hat \sigma}{\sqrt{SST_{j} (1-R_j^2)}} \ = \ \dfrac{ \overbrace{S.\! E.}^{\hat \sigma \ :=}}{\sqrt{SST_{j} (1-R_j^2)}} \tag{2.37}

Simulation study regrading unbiasedness and variance formula Equation 2.31

We assume the true model to be: \, y = 1 + \beta_1 x_1 + \beta_2 x_2 + u
- We assume the true parameters to be: \, \beta_1 = 1 \, and \, \beta_2 = 1
- We generate random vectors for x_1, x_2 and u with 100 observations each and calculate y according the model above
- Thereby, the standard deviation of u is 1, of x_1 is 2 and of x_2 is 4
- The variables x_1 and x_2 are not correlated hence, no multicollinearity
We estimate the model above with the generated variables y, x_1 and x_2 by OLS (with lm()) and store the values of \hat \beta_1 and \hat \beta_2. This estimation procedure is repeated 10000 times
- Finally, we calculate the means and standard deviations of the stored 10000 \hat \beta_1 and \hat \beta_2
- Thus, we expect
  - mean(\hat \beta_1) = 1
  - mean(\hat \beta_2) = 1
- And furthermore, according Equation 2.31, we expect (note, SST_j = (n-k-1) \times \sigma_j^2)
  - se(\hat \beta_1) = \sqrt{(\frac {1}{97 \times 2^2})} = 0.0508
  - se(\hat \beta_2) = \sqrt{(\frac {1}{97 \times 4^2})} = 0.0254

Code

n=100        # number ob observations
repli=10000  # number replication
distri=1     # which distribution for u is used: 1 = normal distibution 
  
B1 = rep(NA, repli)  # initialize B1 for storing estimated b1
B2 = rep(NA, repli)  # initialize B2 for storing estimated b2

for (i in 1:repli) { 
  # u is normal distributed
   if (distri == 1) {
      u  = rnorm(n, mean = 0, sd = 1) # random normal vector, with E(u)=0
      x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1
      x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2
      y = 1 + x1 + x2 + u
   }
  # u is uniformly distributed
   if (distri == 2) {
      u = runif(n, min = -2, max = 2) # random uniform vector, with E(u)=0
      x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1
      x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2
      y = 1 + x1 + x2 + u
   }
  
   model <- lm(y ~ x1 + x2)
   b = coef(model)
  
   B1[i] = b[2]  # storing estimated b1
   B2[i] = b[3]  # storing estimated b2
}
# end loop

# results
mB1 = round(mean(B1), digits = 4)
mB2 = round(mean(B2), digits = 4)

sd1 = round(sd(B1), digits = 4)
sd2 = round(sd(B2), digits = 4)

# theoretical se
sd1th = sqrt( 1 / ( (n-3)*2^2 ) )
sd2th = sqrt( 1 / ( (n-3)*4^2 ) )

hist(B1, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2), 
     main = paste0( "mean(beta1) = ", mB1, ",   sd(beta1) = ", sd1) )
curve(dnorm(x, mean = 1 ,sd=sd1th), add = TRUE, col = "red") 


hist(B2, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2),
     main = paste0( "mean(beta2) = ", mB2, ",   sd(beta2) = ", sd2) )
curve(dnorm(x, mean = 1 ,sd=sd2th), add = TRUE, col = "red")

(a) Theoretical and sample distribution of \hat \beta_1. Standard deviation of x_1 is 2

2.8.3 The Gauss-Markov Theorem

Theorem 2.3 (Gauss-Markov Theorem) \text{ }

Under assumptions MLR.1 - MLR.5, the OLS estimators are the best linear unbiased estimators (BLUEs) of the regression coefficients, i.e.

Var(\hat \beta_j) \ \leq \ Var( \tilde \beta_j), \ \ \forall \ \, \tilde \beta_j = \sum_{i=1}^n c_{ij}y_i \ \text{ for which } \ E(\tilde \beta_j) = \beta_j

Proof of Theorem 2.3

Our model once again:

\mathbf y = \mathbf X \boldsymbol \beta + \mathbf u The OLS estimator is:

\hat{\boldsymbol{\beta}} \ = \ \underbrace {(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'}_{:= \mathbf A} \mathbf{y} \ = \ \mathbf A \mathbf{y} We have another linear and unbiased estimator \tilde{\boldsymbol{\beta}}:

\tilde{\boldsymbol{\beta}} \ = \ \mathbf C \mathbf{y} \ \ \Rightarrow

\tilde{\boldsymbol{\beta}} \ = \ \mathbf C \underbrace {(\mathbf X \boldsymbol \beta + \mathbf u)}_{\mathbf y} \ = \ \mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u \tag{2.38}

\tilde{\boldsymbol{\beta}} should be unbiased (using MLR.4):

E(\tilde{\boldsymbol{\beta}} \mid \mathbf X) \ = \ E(\mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u \mid \mathbf X) \ = \ \mathbf C \mathbf X \, {\boldsymbol{\beta}} + \mathbf C \, \underbrace {E(\mathbf u \mid \mathbf X)}_{\mathbf 0} \ = \ \mathbf C \mathbf X \, \boldsymbol{\beta}

Hence, for unbiasedness of \tilde{\boldsymbol{\beta}} we must have:

\mathbf C \mathbf X = \mathbf I \tag{2.39}

Furthermore, from Equation 2.38, Equation 2.39 and Equation C.7 (using MLR.5) we have

\operatorname {Var}(\tilde{\boldsymbol{\beta}}) \ = \ \sigma^2 \mathbf C \mathbf C'

Define the difference of the covariance matrices:

\mathbf D \ = \ [\operatorname {Var}(\tilde{\boldsymbol{\beta}}) - \operatorname {Var}(\hat {\boldsymbol{\beta}})] \ = \ \sigma^2 [\mathbf C \mathbf C' - (\mathbf{X}' \mathbf{X})^{-1}]

But this equals to:

\mathbf D \ = \ \sigma^2 \mathbf C[\underbrace {\mathbf I - \mathbf X(\mathbf{X}' \mathbf{X})^{-1}\mathbf X'}_{\mathbf M}]\mathbf C'

because of Equation 2.39: \, \mathbf C \mathbf X = \mathbf I = \mathbf X' \mathbf C', rule Equation B.3.

However, the middle part of \mathbf D is equal to orthogonal projection matrix \mathbf M, see Equation C.10.

We know that \mathbf M is idempotent and therefore positive semi definite, p.s.d. (see Appendix B and C)
Thus, the covariance difference matrix \mathbf D is p.s.d. as well (Appendix B), which means that every element of the main diagonal is \geq 0.

Hence, \operatorname {Var} (\tilde \beta_j) \geq \operatorname {Var} (\hat \beta_j), \ \forall j.

q.e.d.

This means that within the class of unbiased and linear (in y_i) estimators, OLS is the one with the lowest variance (with the most precision). We say, OLS is efficient.

The limitation to estimators linear in y_i is somewhat arbitrary. However, we have two important enhancements:
- If we are ready to assume normally distributed errors, MLR.6., OLS is equivalent to a Maximum Likelihood estimation (MLE), which is generally asymptotically efficient in the class of consistent and normally distributed estimators
- An even stronger enhancement is Theorem 4.5 in Hansen (2022). This shows that the restriction to linear estimators in Theorem 2.3 can be completely dropped. The proof of this extended version of the Gauss-Markov Theorem is far beyond the scope this book, see Hansen (2022), Section 4.26

2.9 Misspecifications

2.9.1 Irrelevant variables

Consider the true model

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u

But we estimate

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {red} {\beta_3 x_3} + v

This case, which is called overspecification, poses no problem regarding the unbiasedness, because MLR.4 is not violated through the inclusion of irrelevant variables
Hence, OLS estimates are still unbiased and deliver on average the true values
In particular, as \beta_3 = 0, we will get \hat \beta_3 = 0 on average
However, inclusion of irrelevant variables leads to less precise estimates, as irrelevant variables are often correlated somewhat with y as well as with some relevant x in actual samples
- So, in particular, the variance of \hat \beta_j increases because of a more pronounced multicollinearity problem – R_j^2 increases in Equation 2.31

2.9.2 Omitted variables

Consider the true model

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {green} {\beta_3 x_3} + u

However, we estimate the following misspecified model, omitting the relevant variable \textcolor {green} {x_3} and denoting the coefficients with b instead of \beta

y \ = \ b_0 \, + \, b_1 x_1 \, + \, b_2 x_2 \, + \underbrace {(\textcolor {green} {\beta_3 x_3} + u)}_{v}

The new error term v now includes the omitted variable: v = (\textcolor {green} {\beta_3 x_3} + u)
Therefore, if the omitted variable x_3 is correlated with one or both of the other variables x_1, x_2, as is often the case, assumption MLR.4 is violated, because the new error term v is then correlated via x_3 with the remaining variables x_1, x_2
Hence, we would expect: E(\hat b_j) \neq \beta_j, \ \, j=0,1,2
This case is called underspecification

Omitted Variable Bias

We assume that x_3 is related to the other x-es by

x_3 = \delta_0 + \delta_1 x_1 + \delta_2 x_2 + e \tag{2.40}

Plugging this into the true model and collecting terms we get:

y \ = \ \underbrace{(\beta_0 + \textcolor {green} {\beta_3 \delta_0})}_{b_0} + \underbrace{(\beta_1 + \textcolor {green}{\beta_3 \delta_1})}_{b_1} x_1 + \underbrace{(\beta_2 + \textcolor {green}{\beta_3 \delta_2})}_{b_2} x_2 + \underbrace{(u + \textcolor {green}{\beta_3 e})}_v \tag{2.41}

So, estimating this underspecified model, the estimated coefficients b_j capture both
- the direct (true partial) effect of x_j on y, represented by \beta_j
- an effect of x_3 on y, which is falsely attributed to x_j because of the correlation between x_j and x_3. The extent of this bias is \beta_3 \delta_j
Hence, the Omitted Variable Bias depends on:
- \beta_3, the effect of the omitted variable on y. If \beta_3=0, x_3 is actually irrelevant \ \rightarrow no bias
- \delta_j, the regression coefficient of x_j on the omitted variable x_3. If the omitted variable is not (partially) correlated with the remaining explanatory variables, \delta_j=0, \ j=0,1,2, \ \rightarrow no bias

Under– versus overspecification

The omitted variable bias is one of the most important problems in applied econometrics. Often, a lack of data for important variables is the reason, e.g. ability in the wage equation
Sometimes, however, it is unclear, whether a particular variable should be included in the model
- If the variable is relevant and it is not included, an omitted variable bias emerges
- If the variable is irrelevant and included, the variance of the estimates increases
If there is a severe multicollinearity problem then it could be better to exclude the variable, even if the variable is relevant. Two effect are then at work
- The variance of error term v in Equation 2.41 is larger than the variance of error term u in the true model, \sigma_v^2 = \beta_3^2 \sigma_e^2 + \sigma^2. As the variance formula in Equation 2.31 shows this increases the variance of the OLS estimates
- But on the other hand, omitting a variable reduces the multicollinearity problem – R_j^2 in the variance formula of Equation 2.31 decreases and so the variance of the OLS estimates
- If the multicollinearity problem was really severe, the latter effect could dominate and hence omitting a relevant variable could be beneficial, especially in small samples with many regressors
But even if the latter effect dominates and the variance decreases, this net-effect has to outweigh the bias (the MSE is relevant here). The following example illustrates that matter

Example – omitted variables

We simulate a model with three explanatory variables and \beta_0 = \beta_1 = \beta_2 = \beta_3 = 1

y \ = \ 1 + x_1 + x_2 + x_3 + u

The x_j are mutually related through three error processes, e_1, e_2, e_3 and an additional noise process \eta_j:

x_1 = 0.4 e_1 + 0.3 e_2 + 0.3 e_3 + \eta_1, \quad x_2 = 0.3 e_1 + 0.4 e_2 + 0.3 e_3 + \eta_2

x_3 = 0.3 e_1 + 0.3 e_2 + 0.4 e_3 + \eta_3

(e_1, e_2, e_3, u) are N(0,1) distributed random vectors of sample size n = 1,000

We investigate what happens if x_3 is omitted. According Equation 2.41, we expect that \hat b_j = \hat \beta_j + \hat \beta_3 \hat \delta_j, \ j=0, 1, 2

Simulating data, low multicollinearity

Code

library(AER)
library(modelsummary) # for pretty output 

# generating variables x1-x3, y = x1 + x2 + x3 + u 

n  = 1000  # sample size
e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n);
x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,1);
x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,1); 
x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,1);
u  = rnorm(n)
y = 1 + x1 + x2 + x3 + u

Regressing the true model

Code

out1 <- lm(y ~ x1 + x2 + x3)
sout1 <- summary(out1) 

modelsummary(list("True Model"=out1), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	True Model
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	1.008***	0.050	20.135	<0.001	0.910	1.106
x1	1.032***	0.029	35.920	<0.001	0.976	1.089
x2	1.038***	0.028	37.674	<0.001	0.984	1.093
x3	0.935***	0.029	32.444	<0.001	0.878	0.991
Num.Obs.	1000
R2	0.857
RMSE	0.98
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Regressing the underspecified model

Code

out2 <- lm(y ~ x1 + x2)
sout2 <- summary(out2)

modelsummary(list("Underspecified Model"=out2), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	Underspecified Model
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	1.554***	0.068	22.983	<0.001	1.421	1.686
x1	1.272***	0.040	31.934	<0.001	1.194	1.350
x2	1.163***	0.039	29.738	<0.001	1.087	1.240
Num.Obs.	1000
R2	0.706
RMSE	1.40
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Auxiliary model, Equation 2.40: Regressing x3 as functions of x1 + x2

The R2 is a measure for multicollinearity. Here, there is a moderate degree of multicollinearity

Code

# auxiliary model: regressing x3 as functions of x1 + x2, large noise, low multicollinearity
aux <- lm(x3 ~ x1 + x2)
saux <- summary(aux) 

modelsummary(list("Auxiliary Regression of x3"=aux), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	Auxiliary Regression of x3
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.584***	0.052	11.258	<0.001	0.482	0.685
x1	0.256***	0.031	8.382	<0.001	0.196	0.316
x2	0.134***	0.030	4.458	<0.001	0.075	0.193
Num.Obs.	1000
R2	0.099
RMSE	1.07
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Calulating the coefficients of the underspecified model according Equation 2.41

Code

# estimated coefficients of true and underspecified model
coef_t <- out1$coefficients;  coef_f <- out2$coefficients

# calculated coefficients according text above
# coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model 
coef_cal <- out1$coefficients[1:3] + out1$coefficients["x3"] * aux$coefficients

# standard errors of true and underspecified model
se_t <- sout1$coefficients[1:3,2];  se_f <- sout2$coefficients[1:3,2]; 

# output matrix for table
tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f)
colnames(tab) = c("True_Model", "Missp_Model", 
                  "Calc_Coeffs", "se_True_Model","se_Missp_Model")
vars <- rownames(tab)

# printing tab
# tab

Summarizing results

We have an enormous omitted variable bias
Note that the calculated coefficients (according Equation 2.41) for the underspecified model exactly matches the estimated coefficients for the underspecified model
Furthermore, the estimated standard errors are somewhat higher for the underspecified model

Code

library(gt)
# `gt()` needs a data frame as argument and "tab" is a matrix. 
#  We are rounding to 3 digits
#  The name of the variables are store in the vector "vars" 

gt(data.frame(vars ,round(tab, 3)))

Table 2.2:
True and underspecified models and calculated coefficients, low multicollinearity
vars	True_Model	Missp_Model	Calc_Coeffs	se_True_Model	se_Missp_Model
(Intercept)	1.008	1.554	1.554	0.050	0.068
x1	1.032	1.272	1.272	0.029	0.040
x2	1.038	1.163	1.163	0.028	0.039

Simulating data, high multicollinearity

Code

# generating variables x1-x3, which are correlated and y = x1 + x2 + x3 + u
# variance of noise process eta much lower => higher multicollinearity

n  = 1000  # sample size
e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n);
x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,0,0.05);
x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,0,0.05); 
x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,0,0.05);
u  = rnorm(n)
y = 1 + x1 + x2 + x3 + u

Code

# regressing the true model
out11 <- lm(y ~ x1 + x2 + x3)
sout11 <- summary(out11)

Code

# regressing the underspecified model (omitting x3)
out22 <- lm(y ~ x1 + x2)
sout22 <- summary(out22)

Auxiliary model, Equation 2.40: Regressing x3 as functions of x1 + x2; much better fit than before

Code

# auxiliary model: regressing x3 as functions of x1 + x2, small noise: sigma of eta 1/10 of previous case 

aux <- lm(x3 ~ x1 + x2)
saux <- summary(aux) 

modelsummary(list("Auxiliary Regression of x3"=aux), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	Auxiliary Regression of x3
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.006	0.004	1.374	0.170	-0.003	0.015
x1	0.480***	0.029	16.735	<0.001	0.424	0.537
x2	0.496***	0.029	17.265	<0.001	0.440	0.552
Num.Obs.	1000
R2	0.945
RMSE	0.14
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Code

# estimated coefficients of true and underspecified model
coef_t <- out11$coefficients;  coef_f <- out22$coefficients

# calculated coefficients according text above
# coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model 
coef_cal <- out11$coefficients[1:3] + out11$coefficients["x3"] * aux$coefficients

# standard errors of true and underspecified model
se_t <- sout11$coefficients[1:3,2];  se_f <- sout22$coefficients[1:3,2]; 

# output matrix for table
tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f)
colnames(tab) = c("True_Model", "Missp_Model", 
                  "Calc_Coeffs", "se_True_Model","se_Missp_Model")

# printing tab
# tab

Summarizing the results of both simulations

Code

lowMultiColl = list( "TrueMod"=out1, "MisspMod"=out2)
highMultiColl = list( "TrueMod"=out11, "MisspMod"=out22 )

m1 <- modelsummary(lowMultiColl,
             stars = TRUE,
             gof_omit = "A|L|B|F",
             align = "ldd",
             output = "flextable")

m2 <- modelsummary(highMultiColl,
             stars = TRUE,
             gof_omit = "A|L|B|F",
             align = "ldd",
             output = "flextable")

m1 
m2

Table 2.3: True and underspecified models for low and high multicollinearity

(a)
Low Multicollinearity
	TrueMod	MisspMod
(Intercept)	1.008***	1.554***
	(0.050)	(0.068)
x1	1.032***	1.272***
	(0.029)	(0.040)
x2	1.038***	1.163***
	(0.028)	(0.039)
x3	0.935***
	(0.029)
Num.Obs.	1000	1000
R2	0.857	0.706
RMSE	0.98	1.40
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

?(caption)

(b)
High Multicollinearity
	TrueMod	MisspMod
(Intercept)	1.051***	1.056***
	(0.033)	(0.033)
x1	0.813***	1.173***
	(0.243)	(0.216)
x2	1.460***	1.832***
	(0.245)	(0.216)
x3	0.750**
	(0.237)
Num.Obs.	1000	1000
R2	0.746	0.743
RMSE	1.03	1.04
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

?(caption)

For the high multicollinearity case, we have much more unprecise estimates even for the true model (and we have 1000 observations!). However, the deviations from the true values are not statistically significant
- Note that the increase of the standard errors of the true model due to a rise of multicollinearity is totally in accordance with the variance formula of Equation 2.31. You can calculate the expected rise of the standards errors using the R2 of the two auxiliary equations. The expected and actual increase is in the range of 3.2 to 4 times
For the high multicollinearity case, we have an enormous omitted variable bias as well, which is statistically significant for the variable x_2
Now as before, the calculated coefficients (according Equation 2.41) for the underspecified model exactly matches the estimated coefficients for the underspecified model (not shown in the table)
Furthermore, the estimated standard errors are now lower for the underspecified model (as expected), but the bias dominates this effect by far in our simulation experiment

2.9.3 Errors in variables

A very common problem in econometrics are measurement errors in the model variables
Surprisingly, the consequences of this issue depend on whether the dependent or explanatory variables are affected by this problem

Measurements errors in the dependent variable

We assume that the true value of y is observed only with a measurement error

y^O = y + e \tag{2.42}

with y^O being the actually observed value of y and e being the measurement error

Plugging in for y, our base model Equation 2.1 becomes to

y^O = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + ( \underbrace {u+e}_v ) \tag{2.43}

The only consequence is that now we have a new error term v=(u+e), with variance \sigma_v^2 = \sigma^2+\sigma_e^2, if we assume that the measurement error is uncorrelated with the error term u
- Hence, a measurement error in y leads to more unprecise estimates of all \beta_j, i.e., larger variances of \hat \beta_j, which is not very surprising

Measurements errors in an explanatory variable

We assume that the true value of x_j is only observed with a measurement error

x_j^O = x_j + e_j \tag{2.44}

with x_j^O being the actually observed value of x_j and e_j being the associated measurement error
Plugging in for x_j, our base model Equation 2.1 becomes to

y = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x_j^O + \cdots + \beta_k x_k + ( \underbrace{u - \beta_j e_j}_v ) \tag{2.45}

As a consequence, we have a new error term v=(u-\beta_je_j), with variance \sigma_v^2 = \sigma^2 + \beta_j^2 \sigma_{e_j}^2, if we assume that the measurement error is uncorrelated with the error term u. Hence, as before, a measurement error in x_j leads to more unprecise estimates of all \beta_j
However, there is an additional complication: Because of Equation 2.44, the observed model variable x_j^O is correlated with e_j, leading directly to a correlation of the new error term v with x_j^O and consequently to a violation of the conditional mean independence assumption MLR.4. The results are biased estimates \hat \beta

Summarizing:

A measurement error in the dependent variable leads to more unprecise parameter estimates whereas measurement errors in the explanatory variables additionally lead to biased estimates. This is the error in the variables bias which typically is a downward bias – attenuation bias
However, this result strongly depends on the specification in Equation 2.44. The crucial point is whether the measurement error is correlated with the observable variable. With the specification of Equation 2.44 it is and this is called the classical errors-in-variables model
As an example suppose we want to measure ability in a wage equation by the IQ. In this case we have

IQ = ability + e \tag{2.46}

so that IQ is simply an inaccurate measure of ability with \operatorname{Corr}(IQ,\, e) > 0, leading to the bias described above

Proxy variables

However, there is another interpretation:

Think about ability as a competence which consists of several aspects, besides IQ, there is practical talent, diligence, readiness for efforts, work ethics, etc., the latter ones all summarized in \epsilon

ability = \delta_0 + \delta_1 IQ + \epsilon \tag{2.47}
Here, we implicitly assume that \operatorname{Corr}(IQ,\, \epsilon)=0 \, ! With this approach in mind, we simply can plug in Equation 2.47 in a standard wage equation, using IQ as a so called proxy variable for ability:

wage = \beta_0 + \beta_1 educ+ \beta_2 exper + \beta_3 (\underbrace { \delta_0 + \delta_1 IQ +\epsilon)}_{ability} + v

Collecting terms we get:

wage = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ + (v + \beta_3\epsilon) \tag{2.48}

Taking the conditional expectation of Equation 2.48 we have

E(wage \mid educ, exper, IQ) = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ +

E(v + \beta_3\epsilon \mid educ, exper, IQ)

To get unbiased (or at least consistent) estimates for \beta_1 and \beta_2, the following conditions must hold
- E(v \mid educ, exper, IQ) = 0, which is the usual conditional mean independence assumption. However, this condition is now more likely to be met because ability not a part of v any more
- E(\epsilon \mid educ, exper, IQ) = 0
The later is the more interesting one; it demands that the other factors determining ability, \epsilon, are as well uncorrelated with the explanatory variables in the wage equation, especially with educ
It is doubtful whether this is true in this particular example. However, even if not, using the proxy IQ for ability might nevertheless be useful if the correlation of educ with \epsilon is smaller than the correlation of educ with the unobservable ability, thereby reducing the omitted variable bias
However, in any case, the researcher has to carefully argue that the conditions for the proxy variable approach are actually justifiable on theoretical grounds

Errors in variables – Example

We are estimating a wage equation; we are looking for the effect of education on log(wage) = lwage
Problem: Ability not observable and probably correlated with both educ and lwage ==> effect of education probably overestimated. We try to tackle this problem by using IQ as a proxy variable for unobserved ability, hoping E(\epsilon \mid educ, exper, IQ) form above is zero
Possible additional problem, maybe of even more importance: educ is measured in years which might be a bad indicator for “real” education ==> errors in variables problem ==> underestimation of education

library(wooldridge)
library(modelsummary)
data(wage2)

# Model without proxy for ability, but with a lot of control variables
out1 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban, data=wage2)

# Use of IQ as proxy variable
out2 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ, data=wage2)

# Use of IQ and KWW as proxy variables
out3 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+KWW, data=wage2)

# Use of IQ as proxy variable; additional interaction with educ - slope effect 
out4 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+I(educ*IQ), data=wage2)


modelsummary( list(out1, out2, out3, out4), 
              output="gt", 
              statistic = "statistic",
              coef_omit = "Inter",
              gof_omit = "A|B|L|F", 
              align = "ldddd", 
              stars = TRUE, 
              fmt = 4)

Table 2.4:
Wage equation with different proxy variables for ability (t-statistic in brackets)
	(1)	(2)	(3)	(4)
educ	0.0666***	0.0555***	0.0504***	0.0233
	(10.6825)	(8.0438)	(6.9825)	(0.5701)
exper	0.0139***	0.0140***	0.0124***	0.0138***
	(4.3862)	(4.4471)	(3.8671)	(4.3631)
tenure	0.0113***	0.0109***	0.0104***	0.0110***
	(4.6222)	(4.5032)	(4.2647)	(4.5029)
hours	-0.0053**	-0.0053**	-0.0056***	-0.0052**
	(-3.1692)	(-3.1923)	(-3.3713)	(-3.1668)
married	0.2041***	0.2044***	0.1961***	0.2054***
	(5.2478)	(5.2908)	(5.0661)	(5.3119)
black	-0.1996***	-0.1543***	-0.1407***	-0.1575***
	(-5.3010)	(-3.9120)	(-3.5343)	(-3.9709)
south	-0.0911***	-0.0803**	-0.0824**	-0.0804**
	(-3.4869)	(-3.0754)	(-3.1617)	(-3.0770)
urban	0.1854***	0.1834***	0.1766***	0.1848***
	(6.9082)	(6.8777)	(6.5983)	(6.9147)
IQ		0.0036***	0.0031**	-0.0005
		(3.6086)	(3.0428)	(-0.0915)
KWW			0.0043*
			(2.3326)
I(educ * IQ)				0.0003
				(0.7993)
Num.Obs.	935	935	935	935
R2	0.261	0.271	0.275	0.271
RMSE	0.36	0.36	0.36	0.36
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

2.9.4 The random coefficients model

Until now we always assumed that the coefficient vector \boldsymbol\beta = (\beta_1, \ldots , \beta_k) is constant across all observations

But this assumption is quite restrictive and sometimes even implausible; for instance in a wage equation that would mean that the payoff of education is equal for all individuals
We now want to investigate the consequences when some unobservable factors affects the coefficients \boldsymbol\beta. In particular we assume

\boldsymbol\beta_i = \boldsymbol\beta + \boldsymbol \epsilon_i \tag{2.49}

Here, \boldsymbol\beta_i, the coefficients for observation (unit or individual) i, is equal to \boldsymbol\beta, the population mean which we want to estimate, plus an idiosyncratic error \boldsymbol \epsilon_i, which captures some unobserved factors of unit i affecting the partial effects of the explanatory variables

For instance, in our wage equation example that would mean that the partial effect of education on wage depends on some unobservable (individual specific) factors like diligence or motivation
Remark: If the partial effects \boldsymbol\beta_i depend on observable factors, we could and should deal with them by interaction terms, see Section 5.2

This formulation leads the random coefficients model with \mathbf x_i = (x_{1,i}, \ldots, x_{k,i})

y_i \ = \ \beta_0 + \mathbf x_i \boldsymbol\beta_i + u_i \ = \beta_0 + \ \mathbf x_i(\boldsymbol\beta + \boldsymbol\epsilon_i) + u_i \ \Rightarrow

y_i \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + (\mathbf x_i \boldsymbol \epsilon_i + u_i) \tag{2.50}

Hence, the new error term of the model is (\mathbf x_i \boldsymbol \epsilon_i + u_i) and includes \mathbf x_i which could put the unbiasedness and consistency of \hat {\boldsymbol\beta} at risk. Calculating the conditional expectation we get

E[y_i|\mathbf X] \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + \mathbf x_i E[\boldsymbol\epsilon_i|\mathbf X] + E[u_i|\mathbf X]

E[u_i|\mathbf X]=0 by assumption MLR.4 (exogeneity)
The second requirement for unbiasedness and consistency is E[\boldsymbol\epsilon_i|\mathbf X]=0
- So, the crucial assumption is that the unobserved factors affecting \boldsymbol \beta_i are independent from \mathbf X
- With reference to our example, unobserved factors like diligence or motivation should be independent from the included explanatory variables like education or experience
  - But these unobserved factors are frequently the same as those in u_i, (the factors in u_i affect the intercept \beta_0 and those in \epsilon_i the slopes \boldsymbol\beta_i) and so the second requirement is often not an additional constraint
Although OLS remains unbiased and consistent under E[\boldsymbol\epsilon_i|\mathbf X]=0, the error term (\mathbf x_i \boldsymbol \epsilon_i + u_i) of the random coefficients model almost certainly exhibits heteroskedasticity with the conditional variance containing the term \mathbf x_i \! \operatorname {Var}(\boldsymbol \epsilon_i) \, \mathbf x_i'
- Some authors even view random coefficients (slopes) to be a main reason for heteroskedastic errors

2.9.5 Functional forms

With OLS, some non-linearities in the assumed relationship could be modeled by including non-linear transformations of the variables, such as logs, squares or higher order polynomial terms

Thereby, two difficulties arises
1. The partial effects are not constant, as in a linear model, but depend on the values of the variables. So, in order to get an easy interpretable result, you have to calculate the APEs. In R this can be done with the function avg_slopes() from the marginaleffects package. With plot_predictions() you can plot the non-linear relationships at average values of the other variables
2. It is often not clear, whether a linear specification is sufficient or whether there are important non-linearities at work. Furthermore, there are infinitely many possibilities of non-linearities. So, we need a specification test

RESET

A general test for functional form misspecification is the so called Regression Equation Specification Error Test, or simply RESET - The idea of RESET is to include squares and possibly higher order fitted values of y in the regression

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_j + \textcolor {red}{\delta_1 \hat y^2 + \delta_2 \hat y^3} + u \tag{2.51}

As \hat{y} is a linear combinations of all explanatory variables, terms like \hat y ^i are non-linear transformations of these. So, if the model is correctly specified, terms like \hat y ^i should have no impact on y
With RESET one simply checks whether the terms \hat y ^i are significant, i.e., we test with an F-test whether \delta_1, \delta_2, \ldots are all 0
This can be done
- by hand (estimating the model, storing the predicted values \hat y, calculating \hat y ^2, \hat y ^3 and so on, estimate a second model including these terms and carry out an F-Test)
- or more conveniently in R with the function resettest()
In the following example we take our wage model and check whether a specification without a non-linear term is sufficient

library(AER)
library(wooldridge); data("wage1")

# estimating the linear wage model and storing the results in list myres0
myres0 <- lm(lwage ~ educ + exper, data = wage1)

# carrying out a RESET, the default is inclusion of 2nd and 3rd order 
# terms of predicted values of y
resettest(myres0)

      
        RESET test
      
      data:  myres0
      RESET = 6.6502, df1 = 2, df2 = 521, p-value = 0.001406

# So, the H0 is clearly rejected, meaning that the model suffers from 
# a general misspecification

# Estimating the model with non-linear exper and storing the results in list myres1
myres1 <- lm(lwage ~ educ + exper + expersq, data = wage1) 


# RESET for the full model 
resettest(myres1)

      
        RESET test
      
      data:  myres1
      RESET = 6.1111, df1 = 2, df2 = 520, p-value = 0.002381

summary(myres1)

      
      Call:
      lm(formula = lwage ~ educ + exper + expersq, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -1.96387 -0.29375 -0.04009  0.29497  1.30216 
      
      Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  0.1279975  0.1059323   1.208    0.227    
      educ         0.0903658  0.0074680  12.100  < 2e-16 ***
      exper        0.0410089  0.0051965   7.892 1.77e-14 ***
      expersq     -0.0007136  0.0001158  -6.164 1.42e-09 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4459 on 522 degrees of freedom
      Multiple R-squared:  0.3003,  Adjusted R-squared:  0.2963 
      F-statistic: 74.67 on 3 and 522 DF,  p-value: < 2.2e-16

Even with expersq, RESET points to a functional misspecicification of this regression model. Maybe, additional nonlinear terms or interaction terms are necessary for a correct specification, see Section 5.2

To simplify, we suppress any intercepts and assume that the variables all have an expectation of zero.↩︎
If you want, you can norm the coefficients of Q to one by diving the first equation by 1+t_1 and the second one by 1+t_2. Clearly, this doesn’t alter the solution set of the equations.↩︎
Otherwise, we would have to employ an IV-estimator. For a general discussion of these issues and basic identification problems, see Chapter 7.↩︎
Without intercept, because the intercept is part of the other explanatory variables; actually, you can include the intercept, but \hat \beta_0 would be zero anyway.↩︎

--- title-block-banner: true # title: "Multiple Regression" subtitle: "Mainly based on @wooldridge_Intro_Econometrics, Chapters 3, 9 and @green2017" --- # Multiple Regression \pagebreak ## Topics of this chapter - Motivation for a multiple linear regression model, MLR model - Interpretation of the MLR model - Computation of the coefficients of the MLR model - Frisch-Waugh-Lovell (FWL) Theorem - Algebraic properties of OLS estimates - Standard assumptions for the MLR model - Unbiasedness of OLS - Variance of OLS - Efficiency of OLS - Gauss-Markov Theorem - Misspecification - Overspecification - Underspecification - omitted variable bias - Errors in the variables and proxy variables - Random coefficient model - Functional forms - OLS in matrix notation ------------------------------------------------------------------------ ## Definition of the multiple linear regression model $\text{ }$ Explains variable $y$ in terms of variables $x_1, x_2, \ldots , x_k$ $$ y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + u $$ {#eq-BaseModel} $y,$ ... dependent variable\ $x_j,$ ... independent variables, explanatory variables, regressors\ $\beta_0,$ ... intercept (coefficient of $x_0$, which is a vector of ones)\ $\beta_j, \ j=1, \ldots, k,$ ... slope parameters\ $u,$ ... error term, disturbance -\> represents unobservable factors and influences ------------------------------------------------------------------------ ## Motivation for multiple regression - Incorporate more explanatory factors into the model. This is important for testing economic theories and better predictions - Explicitly holding fixed other factors that otherwise would end up in $u$ and probably cause a correlation between $u$ and the remaining explanatory variable. If this is actually the case, a causal interpretation of the model is not possible ($z_j \rightarrow x \text{ and } z_j \rightarrow y$ case) - Avoid **omitted variable bias** - Can mimic experimental data with random assignment to some extent - Allow for more flexible functional (non-linear) forms ------------------------------------------------------------------------ ### Simulation study To give a more in-depth motivation for multiple regression models we simulate a simple market model with demand and supply function. - We want to estimate the *market demand function* for a certain good, i.e., the demanded quantities in dependence on prices; $$Q_D=f(P)$$ {#eq-s_demand_eq} - The true **demand function** is supposed to be $$Q_D = a_1 P + b_1 Y + u, \quad \ a_1<0, \ \ b_1>0$$ {#eq-demand_eq} - So, demand $Q_D$ **negatively** depends on prices $P$ (mainly substitution effects) and positively on income $Y$. $u$ represents an error term of unobserved factors which likewise affect demand - We additionally have the **supply function**, that **positively** depends on prices $P$ and negatively on wages $W$. $v$ represents an error term of unobserved factors affecting supply $Q_S$ $$Q_S = a_2 P + b_2 W + v, \quad \ a_2>0, \ \ b_2<0$$ {#eq-supply_eq} - Because we are interested only in the price effect, we decide to apply a **simple regression model** with only one explanatory variable -- prices. Is the approach promising? As we will see, the answer is no! ------------------------------------------------------------------------ #### Digression: Reduced forms and identification {#sec-reduced_form} ::: {.callout-caution collapse="true" icon="false"} ## Reduced form of a structural model and observational equivalence Assume that the equilibrium condition $Q_D=Q_S=Q$ is always met and rewrite our demand and supply functions in the following way, so that the two **endogenous** variable $Q$ and $P$ are on the left hand side and the two variables $Y$ and $W$ along with the two error terms $u$ and $v$ remain on the right side. The index $i$ simply indicates that the demand and supply relationships hold for each observation $i$. [^multiple-regression-1] $$Q_i - a_1 P_i \ = \ b_1 Y_i + u_i$$ {#eq-demand_eq1} $$Q_i - a_2 P_i \ = \ b_2 W_i + v_i$$ {#eq-supply_eq1} Thereby, we assume that the variables $Y, W$ are **exogenous** -- determined *outside* the model -- and have nothing to do with the two error terms; i.e., there exit no other equations which can explain $Y, W$ in terms of the errors $u,v$. This basically implies that $E([Y_i, W_i] \mid u, v)= [0, 0]$ so that the conditional mean independence assumptions is met. @eq-demand_eq1 and @eq-supply_eq1 constitute the so called **structural model** with the structural parameters $a_1, a_2, b_1, b_2$, which we are interested in and want to estimate. This market model can be easily written in matrix notation: $$ \underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \times \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \times \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i} $$ {#eq-structural_eq} Or shorter: $$ \mathbf A \, \mathbf y_i \ = \ \mathbf B \, \mathbf x_i \ + \ \mathbf e_i $$ This matrix equation can be solved by multiplying the equation from the left with the inverse of matrix $\mathbf A$. (Remember that $\mathbf A^{-1}\mathbf A = \mathbf I$.) $$ \mathbf y_i \ = \ \mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i $$ {#eq-solution} Or more compact: $$ \mathbf y_i \ = \ \mathbf \Pi \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i $$ {#eq-reduced_form} > @eq-reduced_form is the so-called **Reduced Form** of the model. This shows how the **endogenous** variables in $\mathbf y_i$ (in our case $[Q_i, P_i]'$) **solely** depend on the **exogenous** variables in $\mathbf x_i$ (in our case $[Y_i, W_i]'$) and on the exogenous error term $\mathbf e_i$ ($[u_i, v_i]'$ in our case). > The reduced form represents the so-called **data generating process** because the model is driven by the values of the exogenous variables and thereby generates the values of the endogenous variables. - Thus, the reduced form of the model is used to **simulate** the data for our little simulation study in the text. Thereby, the exogenous variables $Y_i, W_i, u_i, v_i$ are "replaced" with random vectors and @eq-reduced_form, respectively @eq-solution_market_model are used to calculate the corresponding values for $Q_i, P_i$ > Important note: The reduced form parameter matrix $\mathbf \Pi$ can be consistently estimated by OLS as all right hand side variables in @eq-reduced_form are exogenous. Hence, the conditional mean independence assumption is always met for a ("true") reduced form by assumption. #### Identification If the model is just identified, the structural parameters of @eq-structural_eq can be recovered from the estimated reduced form parameter matrix $\mathbf \Pi$. - In particular, for our market model, the reduced form is: $$ \left[\begin{array}{c} Q_i \\ P_i \end{array}\right]\ = \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \times \left[\begin{array}{c} Y_i \\ W_i \end{array}\right] \ + \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1}\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right] $$ {#eq-solution_market_model} - The inverse of $\mathbf A$ is: $$ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right] $$ - This leads to the reduced form parameter matrix $\mathbf \Pi = \mathbf A^{-1} \mathbf B$: $$ \mathbf \Pi \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right] \times \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 b_1 & a_1b_2 \\ -b1 & b_2 \\ \end{array}\right] $$ {#eq-Pi} - To recover the structural parameters $a_1, a_2, b_1, b_2$, we utilize the fact that the formula above implies a relationship for every of the four elements of the estimated $\hat {\mathbf \Pi}$: $$ \hat \pi_{11} = -\dfrac {1}{a_1-a_2} a_2b_1 \quad \quad \hat \pi_{12} = \dfrac {1}{a_1-a_2} a_1b_2 $$ $$ \hat \pi_{21} = -\dfrac {1}{a_1-a_2} b_1 \quad \quad \hat \pi_{22} = \dfrac {1}{a_1-a_2} b_2 \\ $$ - These are four equations in the four structural parameters $a_1, a_2, b_1, b_2$. The solution for this non-linear system of equations is: $$ a_1 = \frac{\hat \pi_{12}}{\hat \pi_{22}} \quad \quad a_2 = \frac{\hat \pi_{11}}{\hat \pi_{21}} $$ $$ b_1 = \hat \pi_{11} - \frac{\hat \pi_{12}\hat \pi_{21}}{\hat \pi_{22}} \quad \quad b_2 = \hat \pi_{12} - \frac{\hat \pi_{11}\hat \pi_{22}}{\hat \pi_{21}} $$ Note that we had **imposed two zero-restrictions** on the matrix $\mathbf B$. In particular, these restrictions mean that variable $W_i$ is not part of the demand function and $Y_i$ is not part of the supply function. If these zero-restrictions are not imposed, we would have to estimate two additional structural parameters which rises the number of structural parameters to six. But six parameters cannot be recovered from four equations in a unique manner. Hence, the structural form would not be uniquely determined by the reduced form. > This means that there exist other **observational equivalent** structural forms which have the same reduced form -- the same data generating process, which generates the very same observations of the endogenous variables $Q$ and $P$. Therefore, the structural model is not identified (estimable) in this case. To demonstrate this very important issue, suppose we multiply the **original** structural model @eq-structural_eq with an invertible, but otherwise **arbitrary transformation matrix** $T, \ \text { with } \ t_1 \cdot t_2 \neq 1$: $$ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \ \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \ \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i} $$ Or more compact: $$ \mathbf A^* \, \mathbf y_i \ = \ \mathbf B^* \, \mathbf x_i \ + \ \mathbf e^*_i $$ {#eq-new_struct_model} with: $$ \mathbf A^* \ := \ \mathbf T \mathbf A \ = \ \left[\begin{array}{cc} (1+t_1) & (-a_1-t_1a_2) \\ (1+t_2) & (-t_2a_1-a_2) \\ \end{array}\right] $$ $$ \mathbf B^* \ := \ \mathbf T \, \mathbf B \ = \ \left[\begin{array}{cc} b_1 & t_1b_2 \\ t_2b_1 & b_2 \\ \end{array}\right] $$ {#eq-Bstar} $$ \mathbf e^*_i \ := \ \mathbf T \, \mathbf e_i \ = \ \left[\begin{array}{c} u_i+t_1v_i & \\ v_i+t_2u_i \\ \end{array}\right] $$ @eq-new_struct_model is a new and **different** structural model, with different demand and supply functions.[^multiple-regression-2] > The important point is that the two different model structures have the same reduced form and thus exhibit the same data generating process yielding the very same observations -- they are **observational equivalent**. To **proof** this claim we calculate the reduced form of the transformed model @eq-new_struct_model $$ \mathbf y_i \ = \ \mathbf A^{*-1} \mathbf B^* \mathbf x_i \ + \ \mathbf A^{*-1} \mathbf e^*_i \ = $$ $$ (\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ (\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf e_i \ = $$ {#eq-new_reuced_form_model} $$ \mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf e_i \ = $$ $$ \mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i $$ This is identical to @eq-solution, which was the reduced form of the original structural model. This proofs the observational equivalence of the original and the alternative (transformed) structure. (To derive this result we used rule @eq-A_MI and the fact that $\mathbf T^{-1} \mathbf T = \mathbf I$.) > Note that the two zero-restrictions in the $\mathbf B$ matrix disappeared in the $\mathbf B^*$ matrix of the new structure, see @eq-Bstar. This means that the applied transformation **is not allowed** if we strongly believe that the two zero-restrictions are true and a feature of the true structure. Hence, these are **identifying restrictions**, ruling out such a transformation of the model. **Corollary**: Our analysis showed that we **always** need some **identifying restrictions** to be able to estimate a structural equation as otherwise, it is always possible to find alternative models, which are observational equivalent. **Remark**: It is clearly possible that some equations of the structural model are identified and some not. For instance, if we impose the zero-restriction for the demand function and not for the supply function, only the demand function would be identified (estimable). ::: [^multiple-regression-1]: To simplify, we suppress any intercepts and assume that the variables all have an expectation of zero. [^multiple-regression-2]: If you want, you can norm the coefficients of $Q$ to one by diving the first equation by $1+t_1$ and the second one by $1+t_2$. Clearly, this doesn't alter the solution set of the equations. ------------------------------------------------------------------------ #### Simulating the market model - First, we compute simulated data according to the specified market model above with random numbers for the exogenous variables and error terms $Y, W, u, v$ - We define $a_1=-1$, $b_1=1$, $a_2=1$, $b_2=-1$, $sd(Y)=3$ and $sd(W)=2$, $sd(u)=0.1$ and $sd(v)=1$ - Remark: Thereby, we employed an (incredible) identifying restriction to be able to estimate the demand function with OLS. The used restriction is that the variance of the error term $u$ is very small. This assumption prevents a noticeable feedback mechanism which would lead to a violation of the conditional mean independence assumption in the demand equation [^multiple-regression-3] [^multiple-regression-3]: Otherwise, we would have to employ an IV-estimator. For a general discussion of these issues and basic identification problems, see @sec-IV. ```{r} #| code-fold: true # Simulating data for the market model # Demand : Q = a1*P + b1*Y + u # Supply : Q = a2*P + b2*W + v # y = (Q, P)', x = (Y, W)', e = (u, v)' # In matrix language: A*y = B*x + e => # The solution of the model is: # y = A^-1*B*x + A^-1*e # which is the so called reduced form, or data generating process # Structural parameters a1=-1 a2=1 b1=1 b2=-1 my = 20 # mean of Y mw = 5 # mean of W # Model matrices A = rbind( c(1, -a1), c(1, -a2) ) B = rbind( c(b1, 0), c(0, b2) ) # Many firms over many periods (pooled cross section) n=1500 u = c( rnorm(n, 0, 0.1) ) # very small variance of u v = c( rnorm(n, 0, 1) ) Y = c( rnorm(n, my, 3) ) W = c( rnorm(n, mw, 2) ) x = rbind(Y, W) e = rbind(u, v) # Model solution y = solve(A) %*% B %*% x + solve(A) %*% diag(2) %*% e # Simulated Q and P Q = y[1, ] P = y[2, ] # Factor variable for Y Y_ = cut( Y, c(0, 13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 50) ) ``` ------------------------------------------------------------------------ #### Estimating the true model of demand - Afterwards, we employ a **multiple** regression model to estimate the demand equation (@eq-demand_eq) which yields nearly perfect parameter estimates for the **true values** $a_1=-1$ and $b_1=1$ (because we have 1000 simulated observations). ```{r, message=FALSE} #| comment: " " library(AER) # to use `coeftest()` # True model of demand, with true price effect equal to -1 # and true income effect equal to 1. # The multiple regression model yields nearly prefect parameter estimates. coeftest( lm(Q ~ P + Y) ) ``` ------------------------------------------------------------------------ #### Estimating the simple (misspecified) model of demand - Now, we employ the **simple** regression model, $Q = c_0 + c_1 P + u$ ```{r} #| comment: " " # Simplified model of demand, ignoring income Y # We are finding a wrong positive (!) price effect DS <- lm(Q ~ P) coeftest( DS ) ``` - This yields a strongly biased price effect! Why? - The simple regression model ignores income $Y$, which ends up in the error term $u$ - But income is positively correlated with prices $P$: Higher income $\rightarrow$ more demand and therefore higher prices, leading to a violation of the conditional mean independence assumtion in the demand function: We have $E(u(Y) \mid P) > 0$ - Hence, we are finding a wrong positive (!) price effect. The true partial effect is $-1$ - Actually, we estimate a sort of mixture of the demand and supply function - What is the reason for this result. The following @fig-supply_demand_model should shed some light on the problem ------------------------------------------------------------------------ #### Plotting the simulated data - Panel a) of @fig-supply_demand_model shows a standard market model with a demand function (in red) and a supply function (in blue). The observed data of $Q$ and $P$ are the result of the intersections of the demand and supply functions -- market equilibrium points. These functions vary because of changing values of income $Y$ and wages $W$ and the error terms $u$ and $v$ - Note that the realized data points do not mimic the supply nor the demand function - Panel b) of @fig-supply_demand_model shows the regression line of a simple regression model of $Q$ on $P$, which we just estimated above. The slope value of `r coef(DS)[2]` does in no way represent the parameter $a_1$ of the true demand function @eq-demand_eq ------------------------------------------------------------------------ ```{r} #| code-fold: true #| label: fig-supply_demand_model #| fig-cap: Model of supply and demandd. Every observation results from an intersection of the supply and demand function (market equilibrium). Supply and demand functions vary because of variations of income, wages and unobsevable shocks in supply and demand (error terms) #| fig-subcap: #| - Market equilibrium points -- intersection points of shifting supply (blue) and demand (red) function (because of varying Y, W, u, v). These are the points (Q, P) we observe #| - Regression line `lm`(Q ~ P) through the data points. But this is not the demand function we look for -- we get a spurious positive effect of P on Q #| - Regression lines `lm`(Q ~ P | Y) for several given income levels Y, which are colored differently. This shows the true partial effect of P on demand Q for a **given** income. The true demand function for the mean income is in red #| - "True demand relationship: Multiple regression model, holding income constant with the help of the FWL-theorem, see @sec-FWL" #| layout-ncol: 2 #| layout-nrow: 2 library(ggplot2) # FWL-theorem # demand equation out_qy <- lm(Q ~ Y) res_qy <- residuals(out_qy) out_py <- lm(P ~ Y) res_py <- residuals(out_py) outQP_y <- lm(res_qy ~ res_py) # supply equation out_qw <- lm(Q ~ W) res_qw <- residuals(out_qw) out_pw <- lm(P ~ W) res_pw <- residuals(out_pw) outQP_r <- lm(res_qw ~ res_pw) # Plotting ggplot( mapping=aes(x=P, y=Q)) + geom_point() + geom_abline(intercept=b1*my-(-4:4), slope=a1, color="#cd0000", linewidth=0.4, linetype = "dashed") + geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) + geom_abline(intercept=b2*mw-(-4:4), slope=a2, color="#0066cc", linewidth=0.4, linetype = "dashed") + geom_abline(intercept=b2*mw, slope=a2, color="#0066cc", linewidth=1) + theme_bw() ggplot( mapping=aes (x=P, y=Q)) + geom_point( ) + geom_smooth(method=lm, se=F, linewidth=1, color="#aa55aa") + theme_bw() ggplot( mapping=aes (x=P, y=Q) ) + geom_point( aes(color=Y_ ) ) + geom_smooth( aes(color=Y_ ), method=lm, se=F, linewidth=1) + geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) + annotate("text", x = 12.3, y = 13, label = "Demand functions Q(P, Y) for given Y", size=6, color="#4488AA") + theme_bw() ggplot( mapping=aes (x=res_py, y=res_qy) ) + geom_point( aes(color=Y_ ) ) + geom_smooth( method=lm, se=F, linewidth=1, color="#dd0000") + theme_bw() ``` ------------------------------------------------------------------------ - In panel c) we are plotting the same data as in b) but *highlighting different levels of income* with different colors. We further draw separate regression lines for the different income levels - This reveals the true negative (!) demand-relationship between quantity demanded and prices, **given certain levels of income** - Panel d) shows the demand relationship for **constant income**, which mimic the true price effect on demand. We accomplished this by purging the quantity and price data from effects of income -- controlling for income. This is an application of the very important **FWL-Theorem** discussed latter in @sec-FWL - To demonstrate the relationship between quantities and prices shown in panel c) we estimate a **simple** regression model, but only for a **very narrow range of income**. This procedure leads to a correct estimation of $a_1$ in a **simple** regression framework, as shown in panels c) for several different given income ranges ```{r} #| comment: " " # We regress Q on P, but only for the subset that Y is from the interval [19.8, 20.2] coeftest( lm(Q ~ P, subset = (Y>19.8 & Y<20.2) ) ) ``` > This fact demonstrates that the estimated coefficients of a multiple regression model represents the effect of one explanatory variable on the dependent variable, holding the other explaining variables at a constant level. ```{r, eval=FALSE} #| echo: FALSE #| comment: " " coeftest( DS <- lm(Q ~ P) ) coeftest( S <- lm(Q ~ P + W) ) # reduced form estimation coeftest( RF_Q <- lm(Q ~ Y + W) ) coeftest( RF_P <- lm(P ~ Y + W) ) RF_Q RF_P a1_ = coef(RF_Q)[2+1] / coef(RF_P)[2+1] # considering intercept, which is coef[1] a2_ = coef(RF_Q)[1+1] / coef(RF_P)[1+1] b1_ = coef(RF_Q)[1+1] - (coef(RF_Q)[2+1] * coef(RF_P)[1+1]) / coef(RF_P)[2+1] b2_ = coef(RF_Q)[2+1] - (coef(RF_Q)[1+1] * coef(RF_P)[2+1]) / coef(RF_P)[1+1] a1_ a2_ b1_ b2_ ``` ------------------------------------------------------------------------ ### Empirical example: Wage equation Measuring the effect of education on wages, but this time *explicitly* holding experience *fixed* The model: $$ \log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + u $$\ $\text{ }$ ```{r, message=F} library(AER) # AER is a general econometric package library(wooldridge) # the wooldridge package includes all data sets used in the textbook data(wage1) # loading data set wage1 of the wooldridge package ``` ```{r} # regressing log(wage) on educ, storing results in list model1 model1 <- lm(lwage ~ educ, data = wage1) ``` ```{r} # regressing log(wage) on educ and exper, storing results in list model2 model2 <- lm(lwage ~ educ + exper, data = wage1) ``` ------------------------------------------------------------------------ ##### Showing estimation results ```{r} #| comment: " " # show estimated parameters for: model1 model2 ``` ------------------------------------------------------------------------ #### Interpretation of the multiple regression model - The multiple linear regression model manages to hold the values of other explanatory variables fixed even if, in reality, they are correlated with the explanatory variable under consideration - Hence, we have a "*ceteris paribus*"-interpretation - It has still to be **assumed** that *unobserved factor* in $u$ do not change as a result of changing explanatory variables (conditional mean independence assumption) - In our particular wage equation example, the effect of one year more education is `r round(coef(model1)[2], digits=3)` on log(wage), or (times 100) `r round(coef(model1)[2], digits=3)*100`% on wage, if other effects are *not* controlled for. But in this case, some of the effect could stem from more experience, which is possibly negative correlated with education - If we additionally include a variable for experience, `exper`, the effect of education on wage increases to `r round(coef(model2)[2], digits=3)*100`%. That means:\ *If we compare two persons* ***with the same level of experience***, *but person A has* ***one*** *year more education than person B, we predict person A to have a wage that is `r round(coef(model2)[2], digits=3)*100`% higher than that of person B* - Hence, in some sense, a ***multiple*** *regression model* "*simulates*" *a* ***controlled experiment*** *by* ***holding constant*** *other relevant factors* - The error term contains all the other variables which we do not control for (maybe because we have no data; e.g., **ability** in our wage equation which probably has an important impact on wages) ------------------------------------------------------------------------ #### Interpretation of coefficients of the multiple regression model - If the depended variable is in logs, we have a **semi elasticity** $$ \beta_j = \dfrac {\partial \log(y)}{\partial x_j} = \dfrac {\partial y/y}{\partial x_j} $$ - This measures how many *percent* $y$ changes (for instance 0.05) if $x_j$ changed by one unit (we have to multiply $\beta_j$ by 100, to get the percentage value 5%) - If *both* the depended and explanatory variable are in logs, we arrive to an **elasticity**, which is measured in percent $$ \beta_j = \dfrac {\partial \log(y)}{\partial \log(x_j)} = \dfrac {\partial y/y}{\partial x_j/x_j} $$ - If only the explanatory variable is in logs, we have $$ \beta_j = \dfrac {\partial y}{\partial \log(x_j)} = \dfrac {\partial y}{\partial x_j/x_j} $$ - This measures how many *units* does the dependent variable change if $\log(x_j)$ is increased by one unit. So, you have to divide $\beta_j$ by 100, to get the effect of an 1% increase in $x_j$ on $y$ ------------------------------------------------------------------------ #### Non-linearities ```{r} #| fig-align: center # plot of effects of exper on lwage plot(wage1$lwage ~ wage1$exper) ``` ------------------------------------------------------------------------ - There seem to be a nonlinear effect of experience on wages - So, we try the model specification: $$ \log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + u $$ ```{r} #| comment: " " # modeling a non-linear effect of exper model3 <- lm(lwage ~ educ + exper + I(exper^2), data = wage1) model3 ``` ------------------------------------------------------------------------ In this particular case, the partial effect of `exper` is $$ \dfrac{\partial lwage}{\partial exper} = \beta_2 + 2\beta_3exper = 0.041 - 0.00143 exper $$ - Hence, this partial effect is not constant but is a function of `exper.` Often, these partial effects are calculated - at the average of the explanatory variables - Partial Effect at Average, **PEA** - or as the effect averaged over all observations, **APE** - Partial effects are sometimes inconvenient to calculate, especially with products of variables - interaction terms. But there are packages in R, for instance `marginaleffects`, which calculates the *average* partial effects and present these in tables or graphically ```{r, mesage=F} #| comment: " " # Calculating the APE for the model above with "marginaleffects" library(marginaleffects) avg_slopes(model3) ``` ------------------------------------------------------------------------ ```{r, results='hide'} #| fig-align: center #| fig-width: 7.5 # with "plot_slopes" from the package "marginaleffects" we can plot the partial effect from above # depended on "exper" plot_slopes(model3, variable = "exper", condition = "exper") + theme_bw() ``` ------------------------------------------------------------------------ ```{r, results='hide'} #| fig-align: center #| fig-width: 7.5 # with "plot_predictions" from the package "marginaleffects" we can plot the partial effect on the # predicted values of wages depended on "exper" plot_predictions(model3, condition="exper") + theme_bw() ``` ------------------------------------------------------------------------ ## Computing the coefficients of the multiple regression model - As in the simple regression model we assume a random sample of $y_i$ and $x_{i,j}$ - We define the regression residuals: $$ \hat u_i := \ y_i - \hat \beta_0 - \hat \beta_1 x_{i,1} - \hat \beta_2 x_{i,2} - \ldots - \hat \beta_k x_{i,k} $$ {#eq-DefRes} - We choose $\hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k$ so that the sum of squared residuals is minimized $$ \underset{\hat{\beta}}{\min } \sum_{i=1}^{n} \hat u_i^2 \ \rightarrow \ \hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k $$ - This leads to the following formula for $\hat \beta_1$ $$ \hat{\beta}_{1} \, = \, \frac{ \sum_{i=1}^{n} \hat{r}_{i,1} \ y_{i}}{\sum_{i=1}^{N} \hat{r}_{i,1}^{2}} $$ {#eq-beta1} - If we compare this formula with the one for the single regression model, we notice that $(x_i - \bar x)$ is now replaced with $\hat r_{i,1}$ - Thereby, $\hat r_{1}$ are the residuals of a regression of $x_{1}$ on all the other explanatory variables of the model, $x_0, x_{2}, \ldots , x_{k}$ - This means, only the part of $x_{1}$ that is **uncorrelated** with all other $x_{2}, \ldots , x_{k}$ is **relevant for estimation of** $\beta_1$ - Formula from @eq-beta1 shows that the multiple regression model from @eq-BaseModel is equivalent to a transformed, single regression model $$ y \ = \ \beta_0 + \beta_1 \hat r_{1} + v $$ {#eq-BaseModel2} - This way, we get the partial effect of $x_1$ on $y$, **holding constant (or controlling for the effects of) all other** $x$-es - This fact is the result of an important general theorem, the **Frisch-Waugh-Lovell (FWL) Theorem** ------------------------------------------------------------------------ ### Matrix notation of the multiple regression model {#sec-matrix} ::: {.callout-caution collapse="true" icon="false"} ## Expand for analysis in matrix notation #### The model $$ y_1 \ = \ \beta_0 + \beta_1 x_{1,1} + \beta_2 x_{1,2} + \ldots + \beta_k x_{1,k} + u_1 \\ \colon \quad \quad \quad \quad \quad \quad \quad \colon \quad \quad \quad \quad \quad \quad \quad \colon \\ y_n \ = \ \beta_0 + \beta_1 x_{n,1} + \beta_2 x_{n,2} + \ldots + \beta_k x_{n,k} + u_n $$ is stacked to ::: {.callout-important appearance="simple" icon="false"} $$ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} $$ {#eq-MBaseModelMatrix} ::: with $$ \mathbf{y} = \underbrace{ \left[\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array}\right]}_{n \ \times \ 1}, \ \ \mathbf{X} = \underbrace{ \left[\begin{array}{ccccc} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1, k} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2, k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{array}\right]}_{n \ \times \ (k+1) }, \ \ \boldsymbol{\beta} = \underbrace{ \left[\begin{array}{c} \beta_{0} \\ \vdots \\ \beta_{k} \end{array}\right]}_{(k+1) \ \times \ 1}, \ \ \mathbf{u} = \underbrace{ \left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]}_{n \ \times \ 1} $$ ------------------------------------------------------------------------ #### Principle of ordinary least square: For calculating the residuals insert an estimate of $\boldsymbol \beta$ in @eq-MBaseModelMatrix $$ \mathbf{y} = \mathbf{X} \boldsymbol{\hat \beta} + \mathbf{\hat u} $$ The residuals are then defined as: $$ \mathbf{\hat u} \ = \ \mathbf y - \mathbf{X} \boldsymbol{\hat \beta} \ = \ (\mathbf y - \mathbf{\hat y}) $$ The sum of the squared residuals is: $$ SSR := \ \sum_{i=1}^{n}\left(\hat{u}_{i}\right)^{2} \ =\ \hat{\mathbf{u}}^{\prime} \hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ = $$ $$ \mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} \\ $$ as $\hat{\boldsymbol{\beta}}' \mathbf{X}' \mathbf{y}$ is scalar and hence equal to $(\hat{\mathbf{\beta}}^{\prime} \mathbf{X}' \mathbf{y})' = \mathbf{y}' \mathbf{X} \hat{\boldsymbol{\beta}}$. For **minimization** of the sum of squared residuals set the first derivative with respect to $\boldsymbol{\hat \beta}$ to zero: $$ \underset{\hat{\beta}}{\min }(S S R): \ \frac{\partial S S R}{\partial \hat{\boldsymbol{\beta}}} \ = \ \frac{\partial\left(\mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}\right)}{\partial \hat{\boldsymbol{\beta}}} \ = 0 \ \ \Rightarrow $$ $$ -2 \mathbf{X}' {\mathbf{y}}+2 \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}=0 $$ ------------------------------------------------------------------------ #### Solution for the parameter vector This gives the so called *normal equations*: $$ \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}' \mathbf{y} \\ $$ *The solution for the parameter vector* $\hat{\boldsymbol{\beta}}$ *therefore is*: ::: {.callout-important appearance="simple" icon="false"} $$ \hat{\boldsymbol{\beta}}=(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} $$ {#eq-Mbeta} ::: Thereby, $$ \mathbf{X}^{\prime} \mathbf{X}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{1, k} \\ 1 & x_{2,1} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & \cdots & x_{n, k} \end{array}\right]= $$ $$ =\left[\begin{array}{cccc} N & \sum_{i} x_{i, 1} & \cdots & \sum_{i} x_{i, k} \\ \sum_{i} x_{i, 1} & \sum_{i} x_{i, 1}^{2} & \cdots & \sum_{i} x_{i, 1} x_{i, k} \\ \sum_{i} x_{i, 2} & \sum_{i} x_{i, 2} x_{i, 1} & \ddots & \sum_{i} x_{i, 2} x_{i, k} \\ \vdots & \vdots & & \vdots \\ \sum_{i} x_{i, k} & \sum_{i} x_{i, k} x_{i, 1} & \cdots & \sum_{i} x_{i, k}^{2} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' \mathbf x_i $$ and $$ \mathbf{X}^{\prime} \mathbf{u}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]=\left[\begin{array}{c} \sum_{i} u_{i} \\ \sum_{i} x_{i, 1} u_{i} \\ \vdots \\ \sum_{i} x_{i, k} u_{i} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' u_i \\ $$ with $\mathbf x_i$ being the $i^{th}$ row of the data matrix $\mathbf X \,$ (so, this is a $1 \times (k+1)$ **row** vector!). $\, \mathbf x_i' \mathbf x_i$ therefore is a *outer* product and hence a $(k+1) \times (k+1)$ *matrix* and $\,\mathbf x_i' u_i$ is a $(k+1) \times 1$ vector. ------------------------------------------------------------------------ #### Estimator for the variance of the error term: $\hat \sigma^2$: ::: {.callout-important appearance="simple" icon="false"} $$ \hat \sigma^2 \ = \ \dfrac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{\mathbf {\hat u}' \mathbf {\hat u}} {n-k-1} $$ {#eq-Msigma} ::: ::: ------------------------------------------------------------------------ ::: {.callout-caution collapse="true" icon="false"} ## Methods of Moment estimator in matrix notation We have the population model for every observation, with $\mathbf x_i$ being a *row* vector with the (*k*+1) explanatory variables: $$ y_i = \mathbf{x}_i \boldsymbol{\beta} + u_i $$ Additionally, we have assumption 4 (MLR.4): $E (\mathbf u_i \mid \mathbf X) = 0$, which implies: $$ E(\mathbf x_i' u_i) \ = \ E_x [E(\mathbf x_i' u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( 0)] \ = \ \mathbf 0 $$ This, furthermore implies: $$ E(\mathbf x_i' u_i) \ = \ E \left(\mathbf x_i' ( y_i - \mathbf x_i \boldsymbol \beta) \right) \ = \ $$ $$ E(\mathbf x_i' y_i) - E(\mathbf x_i' \mathbf x_i \boldsymbol) \boldsymbol \beta \ = \ \mathbf 0 $$ $$ \Rightarrow \ \ \boldsymbol \beta \ = \ E(\mathbf x_i' \mathbf x_i)^{-1} E(\mathbf x_i' y_i) $$ If we have a sample of *n* observations we can implement these (*k*+1) *moment restrictions* by replacing the population (theoretical) moments with their sample counterparts to estimate the (*k*+1) parameters $$ \boldsymbol{\hat \beta} \ = \ \left( \frac{1}{n} \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \frac{1}{n} \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ \left( \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} $$ > In this case, the Methods of Moments (MoM) estimator is the same as the OLS estimator. This is due to our standard assumption MLR.4. ::: ------------------------------------------------------------------------ ## The Frisch-Waugh-Lovell (FWL) Theorem {#sec-FWL} - @eq-BaseModel2 above shows that alternatively, one can estimate the coefficient of *any* explanatory variable of a **multiple** regression within a **simple** regression model by a two steps procedure: - 1 Regress the explanatory variable -- in our case $x_1$ -- on all other explanatory variables including the intercept\ - 2 Regress $y$ on the residuals from this regression (as in @eq-BaseModel2) - **Imortant remark**: If we **additionally** regress $y$ on the other explanatory variables form above and use the residuals of this regression to **replace** $y$ in @eq-BaseModel2, we get the same estimate of $\hat \beta_1$.\ The *advantage* of this variant is that in this case the residuals of @eq-BaseModel2 are identically to those of @eq-BaseModel - This is the famous and very important **FWL Theorem** - Why does this procedure work? - The residuals from the *first* regression is the part of the explanatory variable $x_1$ that is uncorrelated with the other explanatory variables\ - The slope coefficient of the *second* regression (representing @eq-BaseModel2) therefore yields the **isolated effect** of the explanatory variable on the depended variable, because $x_1$ is purged from any indirect effects of the other variables ------------------------------------------------------------------------ #### FWL Theorem - an example We regress Log(`wage`) on `educ`, `exper` and `expersq`. First, we estimate the usual multiple regression model ```{r} # loading a librariy which allows to print the results of several regressions in one table library(modelsummary) # We use the wage1 data from the Wooldridge package data(wage1) ``` ```{r} #| comment: " " # original model, we are interested in the coef of educ out <- lm(lwage ~ educ + exper + expersq, data = wage1) out ``` ------------------------------------------------------------------------ ##### Applying the FWL theorem Now, we want to estimate the coefficient of `educ` via a *single regression*, applying the FWL theorem. So, `educ` is $x_1$ and the other explanatory variables are the `intercept`, `exper`, `expersq.` - Hence, in the **first step** we regress `educ` on \[`intercept`,`exper` `expersq`\] ```{r} # regressing educ on exper + expersq out3 <- lm(educ ~ exper + expersq, data = wage1) educ_res = out3$residuals ``` - Additionally, we regress the dependent variable `lwage` on \[`intercept`,`exper` `expersq`\] and store the residuals in `lwage_res.` This **second step** is *not necessary*, but *useful* if we want to have the residuals of the original model and carry out t-tests ```{r} #| comment: " " # regressing lwage on exper + expersq out2 <- lm(lwage ~ exper + expersq, data = wage1) lwage_res = out2$residuals ``` - In the **third step**, we regress the residuals `lwage_res` on the residuals `educ_res` [^multiple-regression-4]. The coefficient of `educ_res` in *this* regression model should be equal the coefficient of `educ` in the *original* regression model. [^multiple-regression-4]: Without intercept, because the intercept is part of the other explanatory variables; actually, you can include the intercept, but $\hat \beta_0$ would be zero anyway. ```{r} #| comment: " " # regressing lwage_res on educ_res out4 <- lm(lwage_res ~ 0 + educ_res) out4 ``` - As we see, the coefficients are identically. The same is true for standard errors, *t*-values and residual variance - Actually, they differ a tiny bit, because the two step procedure does not take into account that we lost degrees of freedom in the first step. But asymptotically, this is irrelevant. ------------------------------------------------------------------------ ```{r} #| label: tbl-fwl #| tbl-cap: "Aplying the FWL Theorem (t-statistics in brackets)" #| code-fold: show # Using 'modelsummary' for a pretty output modelsummary(list("Multiple Regression"=out, "Simple Regression"=out4), stars = TRUE, fmt = 4, statistic = c('statistic'), gof_omit = "A|L|B|F", align = "ldd", output = "flextable" ) ``` ### FWL Theorem in matrix notation {#sec-FWL_Mat} ::: {.callout-caution collapse="true" icon="false"} ## Expand for proofing the FWL theorem Consider a regression model with two groups of regressors. We are especially interested in the effects of $\, \mathbf X_1$ on $\mathbf y$, i.e. on $\boldsymbol{\hat\beta}_1$ $$ \mathbf{y} = \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{X}_2 \boldsymbol{\hat\beta}_2 + \mathbf{\hat u} $$ {#eq-FWL1} - Define the so called **orthogonal projection matrix** or **residual maker matrix** $\, \mathbf M_2$ as ::: {.callout-important appearance="simple" icon="false"} $$ \mathbf M_2 := \ ( \mathbf I - \underbrace{\mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2'}_{:= \ \mathbf P_2}) = (\mathbf I - \mathbf P_2) $$ {#eq-MMat} ::: - Define the so called **projection matrix** or **hat matrix** $\, \mathbf P_2$ as ::: {.callout-important appearance="simple" icon="false"} $$ \mathbf P_2 := \ \mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' $$ {#eq-PMat} ::: - Multiplying $\mathbf P_2$ with an appropriately dimensioned arbitrary vector $\mathbf z$ yields a vector of **predicted values** $\mathbf{\hat z}$ of a regression of $\, \mathbf z$ on $\mathbf X_2$: $$ \mathbf P_2 \mathbf z \ = \ \mathbf X_2 \underbrace{ (\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' \, \mathbf z }_{\boldsymbol {\hat \delta}} \ = \ \mathbf X_2 \boldsymbol {\hat \delta} \ = \ \mathbf{\hat z} $$ - Thereby $\boldsymbol{\hat \delta}$ is the estimated regression parameter vector of the regression: $\, \mathbf z = \mathbf X_2 \boldsymbol{\hat \delta} + \mathbf{\hat e}$ - Multiplying $\mathbf M_2$ with an appropriately dimensioned arbitrary vector $\mathbf z$ yields the **residuals** $\mathbf {\hat e}$ of a regression of $\, \mathbf z$ on $\mathbf X_2$: $$ \mathbf M_2 \mathbf z \, = \, (\mathbf I - \mathbf P_2) \mathbf z \, = \, \mathbf z - \mathbf P_2 \mathbf z \, = \, \mathbf z - \mathbf X_2 \boldsymbol{\hat\delta} \, = \, \mathbf z - \mathbf{\hat z} \, = \, \boldsymbol{\hat e} $$ ------------------------------------------------------------------------ - Both projection matrices are of enormous importance in econometrics. - Both matrices are symmetric and **idempotent** which means: $\, \mathbf P_2 \mathbf P_2 = \mathbf P_2$ and $\, \mathbf M_2 \mathbf M_2 = \mathbf M_2$ - Furthermore $\, \mathbf P_2, \mathbf M_2$ are **orthogonal**: $\, \mathbf P_2 \mathbf M_2 = \mathbf M_2 \mathbf P_2 = \mathbf 0$ ------------------------------------------------------------------------ Now, we pre-multiply the original multiple regression model, @eq-FWL1, with $\, \mathbf M_2$: $$ \mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \underbrace{\mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2}_{\mathbf 0} + \underbrace{\mathbf M_2\mathbf{\hat u}}_{\mathbf{\hat u}} $$ - To see that $\, \mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2= \mathbf 0$ we can simply multiply out. But from a content point of view, it is clear that regressing $\mathbf X_2$ on $\mathbf X_2$ yields zero-residuals - Furthermore we have $\, \mathbf M_2 \mathbf{\hat u} = (\mathbf I - \mathbf P_2) \mathbf{\hat u} = \mathbf{\hat u}$ - $\mathbf P_2\mathbf{\hat u} = \mathbf 0 \,$ because OLS-residuals are always uncorrelated with regressors -- $\mathbf X_2'\mathbf{\hat u} = \mathbf 0$ (orthogonality property of OLS). So, regressing OLS-residulas on some of their regressors leads to a regresson with zero fit and therefore unchanged residuals ------------------------------------------------------------------------ Hence, after pre-multiplying the original regression model with $\, \mathbf M_2$ we have: $$ \mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{\hat u} $$ - On the left hand side we have the **residuals** of a regression of $\, \mathbf y$ on $\mathbf X_2$: $\, \mathbf M_2 \mathbf{y}$ - On the right hand side we have **residuals** of a regression of $\, \mathbf X_1$ on $\mathbf X_2$: $\, \mathbf M_2 \mathbf X_1$ > Thus, regressing the left hand residuals *only* on the right hand residuals yields the very same estimate $\boldsymbol{\hat \beta}_1$ and the very same residuals $\mathbf{\hat u}$ as the original multiple regression model. This proofs the content of the FWL Theorem. Important Remark: If $\, \mathbf X_2$ only consists of the unit vector $\mathbf i = (1,1, \ldots, 1)'$, we have: $\,\mathbf M_{\mathbf i } = (\mathbf I - \frac {1}{n} \mathbf i \, \mathbf i')$ and $\mathbf P_{\mathbf i} \mathbf X_1 = \mathbf{\bar X_1}$ and therefore $\, \mathbf M_{\mathbf i} \mathbf X_1 = (\mathbf{X_1} - \mathbf{\bar X_1})$. Hence, intercept $\rightarrow$ demeaning ::: ------------------------------------------------------------------------ ## Algebraic properties of OLS estimates - Sum of residuals is zero\ $\sum_{i=1}^n \hat u_i = 0$\ Follows from the Normal Equation @eq-1b.8 - Orthogonality of $x_j$ and $\hat u_i$ - zero covariance\ $\sum_{i=1}^n x_{ij} \hat u_i = 0$\ Follows from the Normal Equation @eq-1b.9 - Sample averages of $y$ and of the regressors lie on regression line (regression hyperplane)\ $\bar{y}=\widehat{\beta}_{0}+\widehat{\beta}_{1} \bar{x}_{1}+\ldots+\widehat{\beta}_{k} \bar{x}_{k}$\ Follows from @eq-1b.10 - Orthogonality allows a decomposition of total variation of $y$ (see @sec-R2)\ $SST = SSE + SSR \ \ \rightarrow$ - Goodness-of-Fit measure: R^2^\ $R^2 := SSE/SST = 1 - SSR/SST$ - Regressing a model, as usual, with an *intercept* yields the very same results as when regressing all variables demeaned (including *y*), but *without* an intercept (follows from the FWL theorem, see proof above, last remark). *Hence, the role of the intercept is demeaning all the variables* - Alternative expression for R-squared: In OLS model *with intercept*, *R*^2^ is equal to the squared correlation between the actual values $y$ and predicted values $\hat y$ $$ R^{2} := \ \dfrac{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{\hat{y}}\right)\right)^{2}}{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right) \left( \sum_{i=1}^{n} ( \hat{y}_{i}-\bar{\hat{y}} )^{2} \right)} $$ {#eq-R2_alternative} - *R*^2^ never decreases when an additional regressor is added - This suggest using an adjusted measure of fit that accounts for *the degrees of freedom*, $df$, of the nominator and denominator:\ *Adjusted R*^2^ or *R bar squared* $$ \bar{R}^{2} := \ 1-\dfrac{SSR/(n-k-1)}{SST/(n-1)}=1-\dfrac{n-1}{n-k-1}\left(1-R^{2}\right) $$ - This measure can *decrease* with additional variables if they do not contribute to explain $y$ in a relevant way (the t-statistic has to be greater than one) ------------------------------------------------------------------------ ::: {.callout-caution collapse="true" icon="false"} ## Expand for algebraic properties of OLS estimates in matrix notation ##### Orthogonality property of OLS **Proposition**: The explanatory variables are orthogonal to the OLS residuals: $$ \mathbf X^{\prime} \hat{\mathbf{u}} \ = \ 0 $$ **Proof**: $$ \hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ = \ (\mathbf{y}-\mathbf{X} \underbrace{\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \, \mathbf{X}^{\prime} \mathbf{y}}_{\hat{\boldsymbol{\beta}}}) $$ {#eq-uhat} Multiplying @eq-uhat by $\ \mathbf X'$ $$ \Rightarrow \ \ \mathbf{X}' \hat{\mathbf{u}} \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\underbrace{\mathbf{X}^{\prime} \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf I} \, \mathbf{X}^{\prime} \mathbf{y}) \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\mathbf{X}^{\prime} \mathbf{y}) \ = \ \mathbf{0} $$ **Corollary**: $$ \mathbf i' \mathbf {\hat u} \ = \ \sum_{i=1}^{n} \hat{u}_{i} \ =\ 0 $$ as the first **row** of $\mathbf X'$ is the ones-vector $\mathbf i = [1, 1, \ldots, 1]$. ::: ------------------------------------------------------------------------ ## Assumptions of the multiple regression model {#sec-MLR} Here, we are using the notation from @wooldridge_Intro_Econometrics - *Assumption* 1 (MLR.1): The model is **linear in parameters** - *Assumption* 2 (MLR.2): Observations $(y_i, x_{i,j})$ are the result of **random sampling** of size *n* from the population model. (Typically, this assumption does *not* hold for time series, but this assumption can be relaxed so that the dependencies between successive observations are not too strong respectively, decline fast enough) - *Assumption* 3 (MLR.3): **No perfect collinearity** - Remarks on MLR.3: This assumption only rules out perfect collinearity/correlation between explanatory variables. Constant variables are also ruled out (collinear with intercept) - An example for perfect collinearity; two candidates and the number of votes depend on the shares of campaign expenditures: $$ vote \, A = \beta_0 + \beta_1 share A + \beta_2 share B + u $$ - The shares from candidates *A* and *B* sum to one and are therefor perfect collinear with the intercept - Imperfect correlation is allowed, but leads to the problem of **multicollinearity** which regularly causes imprecise estimates and will be discussed later - *Assumption* 4 (MLR.4): **Exogeneity of explanatory variables** -- **zero conditional mean** or **conditional mean independence** assumption $$E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = E(u_i) = 0$$ - The values of the explanatory variables must not contain any information about the mean of the unobserved factors. In this case $\frac {\partial E(u \mid x)}{\partial x}=0$ - *Discussion of the zero conditional mean assumptio*n - In a multiple regression model, the zero conditional mean assumption is much more likely to hold because fewer things end up in the error term - Explanatory variables that are **correlated** with the error term are called **endogenous**; endogeneity is a violation of assumption MLR.4 - Explanatory variables that are **uncorrelated** with the error term are called **exogenous**; MLR.4 holds if all explanatory variables are exogenous - **Exogeneity is the key assumption for a causal interpretation of the partial effects measured by** $\beta$ and for **unbiasedness** of the OLS estimates.\ (*Remark*: MLR.4 also generally leads to **consistent** OLS estimates, meaning that with an infinite sample size the parameters can be estimated "infinitely" precise) - *Assumption* 4' (MLR.4'): **Weak Exogeneity of explanatory variables** $$E(u_i \mid x_{i,1}, x_{i,2} \ldots , x_{i,k}) = E(u_i) = 0$$ - This is a weaker variant of MLR.4 and only this is generally **necessary** for **consistent** estimates if the error term is iid. This variant merely requires that $u_i$ and $x_{i,k}$ are uncorrelated *within* the same observation, whereas MLR.4 requries that a particular $u_i$ is uncorrelated with $x_{k}$, which means with every observation of $x_{k}$ - *Assumption* 5 (MLR.5): **Homoscedasticity** of $u_i$ $$Var(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = \sigma^2$$ - The value of the explanatory variables must not contain any information about the variance of the unobserved factors - A *violation* of this assumptions leads to **inefficient** estimates and **incorrect** formulas for statistical tests (both problems can be handled within OLS via appropriately transformed data or heteroscedasticity consistent estimates of the covariance matrix) - *Assumption* 6 (MLR.6): **Conditional normality** of $u_i$ $$u_i \mid (x_{1}, x_{2} \ldots , x_{k}) \ \sim \ N(0, \sigma^2)$$ - This assumption is needed for statistical tests in *small* samples ------------------------------------------------------------------------ ::: {.callout-caution collapse="true" icon="false"} ## MLR.4, MLR.5 in matrix notation and the Covariance Matrix *Assumption* 4 (MLR.4): **Exogeneity** of $\, \mathbf X \,$ (zero conditional mean assumption) ::: {.callout-important appearance="simple" icon="false"} $$ E({\mathbf{u}} \ | \ \mathbf{X}) = 0 $$ {#eq-ZCMA} ::: - This is implied by $E(u_i \ | \ \mathbf{x}_i) = 0$ and MLR.2 (random sampling) - *Assumption* 4' (MLR.4') $$ E(u_i \ | \ \mathbf{x}_i) = 0 $$ {#eq-ZCMA_weak} (*without* random sampling), is weaker than @eq-ZCMA and is called *weak exogeneity* -- it requires only non-correlation between $\mathbf{x}_i$ and $u_i$ within a particular observation $i$ and is necessary for consistency, whereas the stronger MLR.4 is necessary for unbiasedness ------------------------------------------------------------------------ *Assumption* 5 (MLR.5): **Homoscedasticity** of $\mathbf u$. This assumptions together with the assumption of *no autocorrelation in the errors* (random sampling) can be stated in a very compact form by assuming a particular **covariance matrix** ::: {.callout-important appearance="simple" icon="false"} $$ \operatorname {Var} (\mathbf u \ | \ \mathbf X) = \sigma^2 \mathbf I $$ {#eq-HomoscA} ::: - This is implied by $\operatorname {Var} (u_i \mid \mathbf{x}_i) = \sigma^2$ and MLR.2 (random sampling) *Assumption* 5 (MLR.5) can be stated with the help of the concept of a *covariance matrix* - The **covariance matrix** of $\mathbf u$ is defined by: $$ \operatorname {Var} (\mathbf u \ | \ \mathbf X) \ := \ E\left[ \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right) \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right)^{\prime} \right] \ (\text{ by def. }) \ = $$ $$ E\left[(\mathbf{u}-\mathbf{0}) \cdot(\mathbf{u}-\mathbf{0})^{\prime} \ | \ \mathbf{X} \right] \ (\text { by assumpt. } E(\mathbf{u} \ | \ \mathbf{X})=\mathbf{0}) \ = $$ $$ E\left[\mathbf{u} \mathbf{u}' \ | \ \mathbf X \right] \ = \ \begin{array}{l} E \ \left[\begin{array}{cccc} u_{1}^{2} & u_{1} u_{2} & \cdots & u_{1} u_{n} \\ u_{2} u_{1} & u_{2}^{2} & \cdots & u_{2} u_{n} \\ \vdots & \vdots & \ddots & \vdots \\ u_{n} u_{1} & u_{n} u_{2} & \cdots & u_{n}^{2} \end{array}\right]_{ \ \mid\ \mathbf X} \ = \end{array} $$ $$ \begin{array}{l} \left[ \begin{array}{cccc} Var\left(u_{1}\right) & {Cov}\left(u_{1}, u_{2}\right) & \cdots & {Cov}\left(u_{1}, u_{n}\right) \\ {Cov}\left(u_{2}, u_{1}\right) & Var\left(u_{2}\right) & \cdots & {Cov}\left(u_{2}, u_{n}\right) \\ \vdots & \vdots & \ddots & \vdots \\ {Cov}\left(u_{n}, u_{1}\right) & {Cov}\left(u_{n}, u_{2}\right) & \cdots & Var\left(u_{n}\right) \end{array} \right]_{ \ \mid\ \mathbf X} \end{array} $$ - Imposing the assumption of **random sampling** (or *no autocorrelation in the error term*) implies that $\operatorname {Cov} \left( u_i,u_j \mid \mathbf X \right) = 0, \ \forall i \neq j$ - Imposing **homoscedasticity** implies that $\operatorname {Var} \left( u_i \, | \, \mathbf X \right) = \sigma^2, \ \forall i$ - Hence, we have: $$ \begin{array}{l} = & { \left[\begin{array}{cccc} \sigma^{2} & 0 & \cdots & 0 \\ 0 & \sigma^{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^{2} \end{array} \ \right]} \end{array} \ = \ \, \sigma^2 \mathbf I \\ $$ $$ \overset {MLR.5} {\Rightarrow} \ \ \operatorname {Var} (\mathbf u \ | \ \mathbf X) \ = \ \sigma^2 \ \mathbf I $$ ::: ------------------------------------------------------------------------ ## Statistical properties of OLS estimates ### Unbiasedness of OLS ::: {#thm-unbiased1} ## Unbiasedness of OLS $\text{ }$ Under MLR.1 to MLR.4 OLS parameter estimates are unbiased: $$ E(\hat \beta_j) \ = \ \beta_j, \ \ \ \forall j$$\ Further, under MLR.1 to MLR.5: $$ E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \sigma^2 $$ ::: ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-unbiased1 The first step is a very important *alternative* representation for $\, {\hat{\boldsymbol{\beta}}}$, considering the "true" model: $\, \mathbf y = \mathbf X \boldsymbol \beta + \mathbf u$ $$ \hat{\boldsymbol{\beta}} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime}(\underbrace{\mathbf{X} \boldsymbol{\beta}+\mathbf{u}}_{\mathbf y}) \ = $$ $$ =\ \underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{X} }_{\mathbf I} \ \boldsymbol{\beta}+ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \ \ \ \Rightarrow $$ ::: {.callout-important appearance="simple" icon="false"} $$ \hat{\boldsymbol{\beta}} \ = \ \boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u}, \ \text{ or } $$ {#eq-betahat_alternatine} $$ (\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}) \ = \ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} $$ {#eq-betahat_beta} ::: ------------------------------------------------------------------------ **Proof** of $\, E({\boldsymbol{\hat \beta}} ) = \boldsymbol{\beta}.$ Using representation from @eq-betahat_alternatine and using assumption MLR.4 (exogeneity of $\mathbf X$, i.e., $E({\mathbf{u}} \, | \, \mathbf{X}) = 0$), we have: $$ E({\boldsymbol{\hat \beta}} \mid \mathbf{X}) = E \left( \boldsymbol{\beta} + \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \mid \mathbf{X}\right) \ = $$ $$ \boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{E(\mathbf{u} \mid \mathbf{X})}_{0 \text { by assumption }} = \ \boldsymbol{\beta} $$ $$ \Rightarrow \ \ E ( \hat{\boldsymbol{\beta}} ) \ = \ E_x \, \underbrace{ \left( E(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \right) }_{\boldsymbol \beta} \ = \ \boldsymbol{\beta} $$ as the true $\boldsymbol{\beta}$ is a *constant* and therefore obviously independent of $\mathbf{X}.$ q.e.d. ------------------------------------------------------------------------ **Proof** of $\ E(\hat \sigma^2) = \sigma^2$ Given the definition of the **orthogonal projection matrix** (**residual maker matrix**) (@eq-C_MMat), $$ \mathbf M_{_\mathbf X} := \ ( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'}) $$ we have: $\, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf y$. And furthermore, $\, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf u\,$: $$ {\mathbf{\hat u}} = \mathbf M_{_\mathbf X} \mathbf{y} = \mathbf M_{_\mathbf X} (\underbrace{\mathbf X \boldsymbol{\beta}+\mathbf{u}}_{\mathbf{y}}) = \underbrace{\mathbf M_{_\mathbf X} \mathbf{X} \boldsymbol{\beta}}_{\mathbf{0}} + \mathbf M_{_\mathbf X} \mathbf{u} = \mathbf M_{_\mathbf X} \mathbf{u} $$ Now, @thm-unbiased1 says (don't forget that $\, \mathbf M_{_\mathbf X}$ is symmetric and idempotent): $$ E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \left[ \frac{1}{n-k-1} \right] \ E({\mathbf{\hat u}}' {\mathbf{\hat u}}) \ = $$ $$ \left[ \frac{1}{n-k-1} \right] \ E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \ = \ \sigma^2 $$ We proceed with the *conditional* expectation: $$ E(\hat \sigma^2 \, | \, \mathbf X) \ = \ (n-k-1)^{-1} E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}} \, | \, \mathbf X) \ = $$ $$ (n-k-1)^{-1} E( tr ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ = $$ because ${\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}$ is scalar and therfore equal to its trace. $$ (n-k-1)^{-1} tr( E ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ = $$ because both *E* and *tr* are linear operators and therefore are exchangeable. $$ (n-k-1)^{-1} tr( E ({\mathbf{u}}' {\mathbf{u}} \, \mathbf M_{_\mathbf X} ) \, | \, \mathbf X) \ = $$ because of @eq-A_MTr. $$ (n-k-1)^{-1} tr( \underbrace {E ({\mathbf{u}}' {\mathbf{u}} \, | \, \mathbf X )}_{\sigma^2 \mathbf I} \, \mathbf M_{_\mathbf X} ) \ = $$ because of MLR.5. $$ (n-k-1)^{-1} \sigma^2 \, tr( \mathbf M_{_\mathbf X} ) \ = $$ $$ (n-k-1)^{-1} \sigma^2 \, tr( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \ = $$ $$ (n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \right ) \ = $$ because of @eq-A_MTrLin. $$ (n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({ \underbrace { \mathbf X' \mathbf X(\mathbf X' \mathbf X)^{-1}}_{\mathbf I_{k+1}} } ) \right ) \ = $$ because of @eq-A_MTr. $$ (n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I_n ) - tr ( \mathbf I_{(k+1)} ) \right ) \ = $$ because of @eq-A_MTrLin and the dimension of $(\mathbf X' \mathbf X)$ is $(k+1) \times (k+1)$ and the dimension of $\, \mathbf M_{_\mathbf X}$ is $n \times n$ $$ (n-k-1)^{-1} \sigma^2 \, \left ( n - (k+1) \right ) \ = $$ $$ \sigma^2 $$ q.e.d. ::: - This means that in "very many" repeated samples the **average** of the OLS estimates equal the true values - As unbiasedness is an average property in repeated samples, in a *given particular* sample, the estimates may *still be far away from the true values* - The key requirement for that property is the *zero conditional mean* assumption (MLR.4). Whenever $E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) \, \neq \, 0$, OLS estimates are biased, i.e., on average, they are false \_ The additional key requirement for the unbiasedness of $\hat \sigma_2$ is MLR.5 ------------------------------------------------------------------------ ### Variance of OLS {#sec-covmat} ::: {#thm-var1} ### Variance of OLS $\text{ }$ Under MLR.1 to MLR.5 the variance of the OLS estimates is: $$ {Var(\hat \beta_j \mid x_{1}, \ldots , x_{k})} \ = \ \dfrac{\sigma^2}{ \underbrace{SST_{j}}_{\sum_{i=1}^n (x_{i,j} - \bar x_j)^2} (1-R_j^2) } \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_{x_j}^2} \dfrac {1}{(1-R_j^2)} $$ {#eq-covMat_betahat} ::: ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-var1 and the covariance matrix of $\, \boldsymbol{\hat \beta}$ The claim of @eq-covMat_betahat for a slope parameter is quite obvious although a little bit tedious.\ From @thm-var0 we know, that for the simple regression model we have: $$ \operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac {\sigma^2}{SST_x} $$ {#eq-thm22_var} From @eq-BaseModel2 and @eq-beta1 respectively, the discussion about the *FWL-theorem* we know that a *multiple* regression model can be reduced to a *simple* one, for every particular parameter we want to estimate, *for instance* for $\beta_j$. Hence, @eq-covMat_betahat is also relevant for a any parameter $j$ of a multiple regression regression. However, the $x$-variable in regression @eq-BaseModel2 (which was $\hat r_1$) consists itself of the *residuals* of a *first stage regression* of $x_j$ on $[1, x_1, \ldots, x_{j-1}, \ \ \ x_{j+1}, \ldots, x_k]$ and therefore $SST_x$ from @eq-thm22_var above is equal to the $SSR_j$ of that regression of $x_j$ on $[1, x_1, \ldots, x_{j-1}, \ldots, x_{j+1}, \ldots, x_k]\$\$. $$ SST_x \ = \ \sum_{i=1}^n (\hat r_i - \bar {\hat r}_i)^2 \ = \ SSR_j $$ Furthermore, we know from @eq-SST that $SST_j = SSE_j + SSR_j$, which implies: $$ SSR_j = SST_j - SSE_j $$ {#eq-SSR_} whereby $SST_j = \sum_{i=1}^n (x_{i,j} - \bar x_j)$. From the definition of $R^2$, @eq-R2, we have for the $R^2_j$ of the first stage regression, $R^2_j = \frac {SSE_j}{SST_j}$, which implies: $$ SSE_j = SST_j \cdot R^2_j $$ Inserting this into @eq-SSR_ yields: $$ SSR_j \ = \ SST_j - SST_j \cdot R^2_j \ = \ SST_j(1-R^2_j) $$ So, we can substitute $SST_j(1-R^2_j)$ for $SST_x$ in @eq-thm22_var and arrive to @eq-covMat_betahat. q.e.d. ------------------------------------------------------------------------ ${Var(\hat \beta_j \mid x_{1}, x_{2} \ldots , x_{k})}$ from @eq-covMat_betahat is the $j^{th}$ diagonal element of the covariance matrix for $\boldsymbol {\hat \beta}$. So, lets calculate that covariance matrix. #### The covariance matrix of ${\mathbf {\hat \beta}}$ Under the assumption MLR.5 $$ \operatorname{Var}(\mathbf u \mid \mathbf X) = E (\mathbf{u u}' \mid \mathbf{X}) = \sigma^2 \mathbf I \, , $$ unbiasedness of $\, \boldsymbol {\hat \beta}$, @eq-C_betahat_beta and the definition of a covariance matrix, @eq-C_covmatrix_defintion, we have: $$ \operatorname{Var}(\boldsymbol{\hat \beta} \mid \mathbf{X}) := \ E\left[(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))^{\prime} \mid \mathbf{X}\right]\ = $$ $$ E\left[(\boldsymbol{\hat \beta}-\boldsymbol{\beta})(\boldsymbol{\hat \beta}-\boldsymbol{ \beta})^{\prime} \mid \mathbf{X}\right] \ = $$ $$ E [ (\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{(\hat{\beta}-\beta)} )(\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u}}_{(\hat{\beta}-\beta)^{\prime}})^{\prime} \mid \mathbf{X}] \ = $$ $$ E\left[ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u} \mathbf{u}^{\prime} \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mid \mathbf{X} \right] \ = $$ $$ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{ \ E\left(\mathbf{u u}' \mid \mathbf{X}\right)}_{\sigma^{2} \mathbf I} \ \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} $$ $$ \Rightarrow \ \ \operatorname{Var}( \boldsymbol{\hat\beta} \mid \mathbf X) \ = \ \sigma^{2}(\mathbf{X}' \mathbf{X})^{-1} $$ {#eq-covmat_beta_hat} ------------------------------------------------------------------------ Applying formula @eq-C_betahat_alternatine and variance formula @eq-C_covmatrix_lin_combination we directly get: $$ \operatorname{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \ = \ \operatorname{Var} [ \underbrace{ \boldsymbol{\beta} +\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{ \hat{\boldsymbol{\beta}}} \mid \mathbf{X}] \ = $$ $$ \underbrace{ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime}}_{\mathbf A} \ \underbrace{ \operatorname{Var}(\mathbf{u} \mid \mathbf{X})}_{\mathbf \Sigma} \ \underbrace{\mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf A'} \ = $$ $$ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \ \sigma^{2} \mathbf{I} \ \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \ = $$ $$ \sigma^{2}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} $$ ::: - Besides MLR.4, the key requirement for this theorem is MLR.5, homoscedasticity. If this assumption is violated, i.e, the variances of $u_i$ depend on the explanatory variables (the $x$-es), the formula above is simply false (for this case, there exits a modified formula) ------------------------------------------------------------------------ #### Factors determining $\operatorname {Var}(\hat \beta_j \, | \, \mathbf x)$ According @eq-covMat_betahat, the variance of $\hat \beta_j$ - *increases* with $\sigma^2$, the variance of the error term - a high error variance increases the sampling variance of $\beta_j$ because there is more “noise” in the equation; this necessarily makes estimates more imprecise - Remark: The error variance itself does not decrease with sample size - *decreases* with $SST_{j}$, the total sum of squared $(x_j-\bar x_j)$ - generally, more sample variation in the $x$-es leads to more precise estimates - $SST_{j}$ *automatically* increases with the *sample size n*. Increasing the sample size is thus a way to get more precise estimates - These two factors are identical to the term $\frac {1}{n} \frac {\sigma^2}{s_{x_{(j)}}^2}$ we already had with the simple regression model -- compare @thm-var0. - *increases* with $R_j^2$, the *R*-squared of a regression of $x_j$ on all the other right hand variables. This is the **new aspect of a multiple regression model** - In particular, this is a measure of the degree of **multicollinearity** between the $x$-es. The higher the correlation between $x_j$ with the other $x$-es, the lower the *independent* variation of $x_j$ and the less *additional* information contained in the variations of $x_j$ - Multicollinearity may be measured by the **variance inflation factor**: $VIF_j := 1 / (1-R_j^2)$. As an (quite arbitrary) rule of thumb, the variance inflation factor should not be larger than 10 Thus, the variance of $\hat \beta_j$ in a multiple regression model corresponds to the variance of the simple regression model times the variance inflation factor, see @eq-covMat_betahat ------------------------------------------------------------------------ #### Variance estimation According @thm-unbiased1, an unbiased estimator of $\sigma^2$ is given by $$ \hat \sigma^2 \ = \ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{SSR}{df} $$ {#eq-sigma_hat} Given @eq-covMat_betahat, an unbiased estimator for the variance of $\hat \beta_j$ is $$ \widehat{Var(\hat \beta_j)} \ = \ \dfrac{\hat \sigma^2}{SST_{j} (1-R_j^2)} $$ {#eq-var_beta_hat_hat} And for the standard error of $\hat \beta_j$ $$ se(\hat \beta_j) = \sqrt{\widehat{Var(\hat \beta_j)}} \ = \ \dfrac{\hat \sigma}{\sqrt{SST_{j} (1-R_j^2)}} \ = \ \dfrac{ \overbrace{S.\! E.}^{\hat \sigma \ :=}}{\sqrt{SST_{j} (1-R_j^2)}} $$ {#eq-se_beta_hat_hat} ------------------------------------------------------------------------ ::: {.callout-caution collapse="true" icon="false"} ## Simulation study regrading unbiasedness and variance formula @eq-covMat_betahat - We assume the **true model** to be: $\, y = 1 + \beta_1 x_1 + \beta_2 x_2 + u$ - We assume the *true* parameters to be: $\, \beta_1 = 1 \,$ and $\, \beta_2 = 1$ - We generate random vectors for $x_1$, $x_2$ and $u$ with 100 observations each and calculate $y$ according the model above - Thereby, the *standard deviation* of $u$ is 1, of $x_1$ is 2 and of $x_2$ is 4 - The variables $x_1$ and $x_2$ are not correlated hence, no multicollinearity - We estimate the model above with the generated variables $y$, $x_1$ and $x_2$ by OLS (with `lm()`) and store the values of $\hat \beta_1$ and $\hat \beta_2$. This estimation procedure is **repeated** 10000 times - Finally, we calculate the *means* and *standard deviations* of the stored 10000 $\hat \beta_1$ and $\hat \beta_2$ - Thus, we expect - $mean(\hat \beta_1) = 1$ - $mean(\hat \beta_2) = 1$ - And furthermore, according @eq-covMat_betahat, we expect (note, $SST_j = (n-k-1) \times \sigma_j^2$) - $se(\hat \beta_1) = \sqrt{(\frac {1}{97 \times 2^2})} = 0.0508$ - $se(\hat \beta_2) = \sqrt{(\frac {1}{97 \times 4^2})} = 0.0254$ ```{r} #| layout-ncol: 2 #| label: fig-Fig_simstudy #| fig-cap: "Distibution of estimated regression parameters, n = 100, repli = 10000" #| fig-subcap: #| - Theoretical and sample distribution of $\hat \beta_1$. Standard deviation of $x_1$ is 2 #| - Theoretical and sample distribution of $\hat \beta_2$. Standard deviation of $x_2$ is 4 #| fig-asp: 0.75 #| code-fold: true n=100 # number ob observations repli=10000 # number replication distri=1 # which distribution for u is used: 1 = normal distibution B1 = rep(NA, repli) # initialize B1 for storing estimated b1 B2 = rep(NA, repli) # initialize B2 for storing estimated b2 for (i in 1:repli) { # u is normal distributed if (distri == 1) { u = rnorm(n, mean = 0, sd = 1) # random normal vector, with E(u)=0 x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1 x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2 y = 1 + x1 + x2 + u } # u is uniformly distributed if (distri == 2) { u = runif(n, min = -2, max = 2) # random uniform vector, with E(u)=0 x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1 x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2 y = 1 + x1 + x2 + u } model <- lm(y ~ x1 + x2) b = coef(model) B1[i] = b[2] # storing estimated b1 B2[i] = b[3] # storing estimated b2 } # end loop # results mB1 = round(mean(B1), digits = 4) mB2 = round(mean(B2), digits = 4) sd1 = round(sd(B1), digits = 4) sd2 = round(sd(B2), digits = 4) # theoretical se sd1th = sqrt( 1 / ( (n-3)*2^2 ) ) sd2th = sqrt( 1 / ( (n-3)*4^2 ) ) hist(B1, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2), main = paste0( "mean(beta1) = ", mB1, ", sd(beta1) = ", sd1) ) curve(dnorm(x, mean = 1 ,sd=sd1th), add = TRUE, col = "red") hist(B2, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2), main = paste0( "mean(beta2) = ", mB2, ", sd(beta2) = ", sd2) ) curve(dnorm(x, mean = 1 ,sd=sd2th), add = TRUE, col = "red") ``` ::: ------------------------------------------------------------------------ ### The Gauss-Markov Theorem ::: {#thm-GMT} ### Gauss-Markov Theorem $\text{ }$ Under assumptions MLR.1 - MLR.5, the OLS estimators are the best linear unbiased estimators (BLUEs) of the regression coefficients, i.e. $$ Var(\hat \beta_j) \ \leq \ Var( \tilde \beta_j), \ \ \forall \ \, \tilde \beta_j = \sum_{i=1}^n c_{ij}y_i \ \text{ for which } \ E(\tilde \beta_j) = \beta_j $$ ::: ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-GMT Our model once again: $$ \mathbf y = \mathbf X \boldsymbol \beta + \mathbf u $$ The OLS estimator is: $$ \hat{\boldsymbol{\beta}} \ = \ \underbrace {(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'}_{:= \mathbf A} \mathbf{y} \ = \ \mathbf A \mathbf{y} $$ We have another linear and unbiased estimator $\tilde{\boldsymbol{\beta}}$: $$ \tilde{\boldsymbol{\beta}} \ = \ \mathbf C \mathbf{y} \ \ \Rightarrow $$ $$ \tilde{\boldsymbol{\beta}} \ = \ \mathbf C \underbrace {(\mathbf X \boldsymbol \beta + \mathbf u)}_{\mathbf y} \ = \ \mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u $$ {#eq-GMTbetatilde} $\tilde{\boldsymbol{\beta}}$ should be unbiased (using MLR.4): $$ E(\tilde{\boldsymbol{\beta}} \mid \mathbf X) \ = \ E(\mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u \mid \mathbf X) \ = \ \mathbf C \mathbf X \, {\boldsymbol{\beta}} + \mathbf C \, \underbrace {E(\mathbf u \mid \mathbf X)}_{\mathbf 0} \ = \ \mathbf C \mathbf X \, \boldsymbol{\beta} $$ Hence, for unbiasedness of $\tilde{\boldsymbol{\beta}}$ we must have: $$ \mathbf C \mathbf X = \mathbf I $$ {#eq-GMT_CX} Furthermore, from @eq-GMTbetatilde, @eq-GMT_CX and @eq-C_covmatrix_lin_combination (using MLR.5) we have $$ \operatorname {Var}(\tilde{\boldsymbol{\beta}}) \ = \ \sigma^2 \mathbf C \mathbf C' $$ Define the *difference of the covariance matrices*: $$ \mathbf D \ = \ [\operatorname {Var}(\tilde{\boldsymbol{\beta}}) - \operatorname {Var}(\hat {\boldsymbol{\beta}})] \ = \ \sigma^2 [\mathbf C \mathbf C' - (\mathbf{X}' \mathbf{X})^{-1}] $$ But this equals to: $$ \mathbf D \ = \ \sigma^2 \mathbf C[\underbrace {\mathbf I - \mathbf X(\mathbf{X}' \mathbf{X})^{-1}\mathbf X'}_{\mathbf M}]\mathbf C' $$ because of @eq-GMT_CX: $\, \mathbf C \mathbf X = \mathbf I = \mathbf X' \mathbf C'$, rule @eq-A_MT. However, the middle part of $\mathbf D$ is equal to *orthogonal projection matrix* $\mathbf M$, see @eq-C_MMat. - We know that $\mathbf M$ is *idempotent* and therefore *positive semi definite*, p.s.d. (see Appendix B and C) - Thus, the covariance difference matrix $\mathbf D$ is p.s.d. as well (Appendix B), which means that every element of the main diagonal is $\geq 0$. Hence, $\operatorname {Var} (\tilde \beta_j) \geq \operatorname {Var} (\hat \beta_j), \ \forall j$. q.e.d. ::: ------------------------------------------------------------------------ > This means that within the class of *unbiased* and *linear* (in $y_i$) estimators, OLS is the one with the lowest variance (with the most precision). We say, OLS is **efficient**. - The limitation to estimators linear in $y_i$ is somewhat arbitrary. However, we have two important enhancements: - If we are ready to assume normally distributed errors, MLR.6., OLS is equivalent to a Maximum Likelihood estimation (MLE), which is generally *asymptotically* efficient in the class of consistent and normally distributed estimators - An even stronger enhancement is Theorem 4.5 in @hansen2022econometrics. This shows that the *restriction* to linear estimators in @thm-GMT can be *completely* dropped. The proof of this **extended version of the Gauss-Markov Theorem** is far beyond the scope this book, see @hansen2022econometrics, Section 4.26 ------------------------------------------------------------------------ ## Misspecifications ### Irrelevant variables Consider the true model $$ y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u $$ But we estimate $$ y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {red} {\beta_3 x_3} + v $$ - This case, which is called **overspecification**, poses no problem regarding the unbiasedness, because MLR.4 is not violated through the inclusion of irrelevant variables\ - Hence, OLS estimates are still unbiased and deliver on *average* the true values\ - In particular, as $\beta_3 = 0$, we will get $\hat \beta_3 = 0$ on average - However, inclusion of *irrelevant* variables leads to less precise estimates, as irrelevant variables are often correlated somewhat with $y$ as well as with some *relevant* $x$ in actual samples - So, in particular, the variance of $\hat \beta_j$ increases because of a more pronounced multicollinearity problem -- $R_j^2$ increases in @eq-covMat_betahat ------------------------------------------------------------------------ ### Omitted variables {#sec-omitVar} Consider the true model $$ y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {green} {\beta_3 x_3} + u $$ However, we estimate the following misspecified model, omitting the *relevant* variable $\textcolor {green} {x_3}$ and denoting the coefficients with $b$ instead of $\beta$ $$ y \ = \ b_0 \, + \, b_1 x_1 \, + \, b_2 x_2 \, + \underbrace {(\textcolor {green} {\beta_3 x_3} + u)}_{v} $$ - The new error term $v$ now includes the omitted variable: $v = (\textcolor {green} {\beta_3 x_3} + u)$ - Therefore, if the omitted variable $x_3$ is correlated with one or both of the other variables $x_1, x_2$, as is often the case, assumption MLR.4 is *violated*, because the new error term $v$ is then correlated via $x_3$ with the remaining variables $x_1, x_2$ - Hence, we would expect: $E(\hat b_j) \neq \beta_j, \ \, j=0,1,2$ - This case is called **underspecification** ------------------------------------------------------------------------ #### Omitted Variable Bias We assume that $x_3$ is related to the other $x$-es by $$ x_3 = \delta_0 + \delta_1 x_1 + \delta_2 x_2 + e $$ {#eq-omitaux} Plugging this into the *true* model and collecting terms we get: $$ y \ = \ \underbrace{(\beta_0 + \textcolor {green} {\beta_3 \delta_0})}_{b_0} + \underbrace{(\beta_1 + \textcolor {green}{\beta_3 \delta_1})}_{b_1} x_1 + \underbrace{(\beta_2 + \textcolor {green}{\beta_3 \delta_2})}_{b_2} x_2 + \underbrace{(u + \textcolor {green}{\beta_3 e})}_v $$ {#eq-omitbias} - So, estimating this underspecified model, the estimated coefficients $b_j$ capture **both** - the direct (true partial) effect of $x_j$ on $y$, represented by $\beta_j$ - an effect of $x_3$ on $y$, which is *falsely* attributed to $x_j$ because of the correlation between $x_j$ and $x_3$. The extent of this *bias* is $\beta_3 \delta_j$ - Hence, the **Omitted Variable Bias** depends on: - $\beta_3$, the effect of the omitted variable on $y$. If $\beta_3=0$, $x_3$ is actually irrelevant $\ \rightarrow$ no bias - $\delta_j$, the regression coefficient of $x_j$ on the omitted variable $x_3$. If the omitted variable is *not* (partially) correlated with the remaining explanatory variables, $\delta_j=0, \ j=0,1,2$, $\ \rightarrow$ no bias ------------------------------------------------------------------------ #### Under-- versus overspecification - *The omitted variable bias is one of the most important problems in applied econometrics*. Often, a lack of data for important variables is the reason, e.g. *ability* in the wage equation - Sometimes, however, it is unclear, whether a particular variable should be included in the model - If the variable is relevant and it is not included, an *omitted variable bias* emerges - If the variable is irrelevant and included, the *variance* of the estimates increases - If there is a severe multicollinearity problem then it could be better to exclude the variable, even if the variable is relevant. Two effect are then at work - The variance of error term $v$ in @eq-omitbias is larger than the variance of error term $u$ in the true model, $\sigma_v^2 = \beta_3^2 \sigma_e^2 + \sigma^2$. As the variance formula in @eq-covMat_betahat shows this increases the variance of the OLS estimates - But on the other hand, omitting a variable reduces the multicollinearity problem -- $R_j^2$ in the variance formula of @eq-covMat_betahat decreases and so the variance of the OLS estimates - If the multicollinearity problem was really severe, the latter effect could dominate and hence omitting a relevant variable could be beneficial, especially in *small samples* with *many* regressors - But even if the latter effect dominates and the variance decreases, this net-effect *has to outweigh the bias* (the MSE is relevant here). The following example illustrates that matter ------------------------------------------------------------------------ ::: {.callout-note collapse="true" icon="false"} ## Example -- omitted variables We **simulate** a model with three explanatory variables and $\beta_0 = \beta_1 = \beta_2 = \beta_3 = 1$ $$ y \ = \ 1 + x_1 + x_2 + x_3 + u $$ The $x_j$ are mutually related through three error processes, $e_1, e_2, e_3$ and an additional noise process $\eta_j$: $$ x_1 = 0.4 e_1 + 0.3 e_2 + 0.3 e_3 + \eta_1, \quad x_2 = 0.3 e_1 + 0.4 e_2 + 0.3 e_3 + \eta_2 $$ $$ x_3 = 0.3 e_1 + 0.3 e_2 + 0.4 e_3 + \eta_3 $$ $(e_1, e_2, e_3, u)$ are N(0,1) distributed random vectors of sample size *n* = 1,000 We investigate what happens if $x_3$ is omitted. According @eq-omitbias, we expect that $\hat b_j = \hat \beta_j + \hat \beta_3 \hat \delta_j, \ j=0, 1, 2$ ------------------------------------------------------------------------ ##### Simulating data, low multicollinearity ```{r} #| code-fold: true library(AER) library(modelsummary) # for pretty output # generating variables x1-x3, y = x1 + x2 + x3 + u n = 1000 # sample size e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n); x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,1); x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,1); x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,1); u = rnorm(n) y = 1 + x1 + x2 + x3 + u ``` ------------------------------------------------------------------------ ##### Regressing the true model ```{r} #| code-fold: true out1 <- lm(y ~ x1 + x2 + x3) sout1 <- summary(out1) modelsummary(list("True Model"=out1), shape = term ~ statistic , statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), stars = TRUE, gof_omit = "A|L|B|F", align = "ldddddd", output = "gt") ``` ------------------------------------------------------------------------ ##### Regressing the underspecified model ```{r} #| code-fold: true out2 <- lm(y ~ x1 + x2) sout2 <- summary(out2) modelsummary(list("Underspecified Model"=out2), shape = term ~ statistic , statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), stars = TRUE, gof_omit = "A|L|B|F", align = "ldddddd", output = "gt") ``` ------------------------------------------------------------------------ ##### Auxiliary model, @eq-omitaux: Regressing x3 as functions of x1 + x2 - The $R2$ is a measure for multicollinearity. Here, there is a moderate degree of multicollinearity ```{r} #| code-fold: true # auxiliary model: regressing x3 as functions of x1 + x2, large noise, low multicollinearity aux <- lm(x3 ~ x1 + x2) saux <- summary(aux) modelsummary(list("Auxiliary Regression of x3"=aux), shape = term ~ statistic , statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), stars = TRUE, gof_omit = "A|L|B|F", align = "ldddddd", output = "gt") ``` ------------------------------------------------------------------------ ##### Calulating the coefficients of the underspecified model according @eq-omitbias ```{r, message=F} #| code-fold: true # estimated coefficients of true and underspecified model coef_t <- out1$coefficients; coef_f <- out2$coefficients # calculated coefficients according text above # coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model coef_cal <- out1$coefficients[1:3] + out1$coefficients["x3"] * aux$coefficients # standard errors of true and underspecified model se_t <- sout1$coefficients[1:3,2]; se_f <- sout2$coefficients[1:3,2]; # output matrix for table tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f) colnames(tab) = c("True_Model", "Missp_Model", "Calc_Coeffs", "se_True_Model","se_Missp_Model") vars <- rownames(tab) # printing tab # tab ``` ------------------------------------------------------------------------ ##### Summarizing results - We have an enormous omitted variable bias - Note that the *calculated* coefficients (according @eq-omitbias) for the underspecified model exactly matches the *estimated* coefficients for the underspecified model - Furthermore, the estimated standard errors are somewhat higher for the underspecified model ```{r} #| code-fold: true #| label: tbl-OmitVar #| tbl-cap: "True and underspecified models and calculated coefficients, low multicollinearity" library(gt) # `gt()` needs a data frame as argument and "tab" is a matrix. # We are rounding to 3 digits # The name of the variables are store in the vector "vars" gt(data.frame(vars ,round(tab, 3))) ``` ------------------------------------------------------------------------ ##### Simulating data, high multicollinearity ```{r} #| code-fold: true # generating variables x1-x3, which are correlated and y = x1 + x2 + x3 + u # variance of noise process eta much lower => higher multicollinearity n = 1000 # sample size e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n); x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,0,0.05); x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,0,0.05); x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,0,0.05); u = rnorm(n) y = 1 + x1 + x2 + x3 + u ``` ```{r} #| code-fold: true # regressing the true model out11 <- lm(y ~ x1 + x2 + x3) sout11 <- summary(out11) ``` ```{r,} #| code-fold: true # regressing the underspecified model (omitting x3) out22 <- lm(y ~ x1 + x2) sout22 <- summary(out22) ``` ##### Auxiliary model, @eq-omitaux: Regressing x3 as functions of x1 + x2; much better fit than before ```{r} #| code-fold: true # auxiliary model: regressing x3 as functions of x1 + x2, small noise: sigma of eta 1/10 of previous case aux <- lm(x3 ~ x1 + x2) saux <- summary(aux) modelsummary(list("Auxiliary Regression of x3"=aux), shape = term ~ statistic , statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), stars = TRUE, gof_omit = "A|L|B|F", align = "ldddddd", output = "gt") ``` ```{r} #| code-fold: true # estimated coefficients of true and underspecified model coef_t <- out11$coefficients; coef_f <- out22$coefficients # calculated coefficients according text above # coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model coef_cal <- out11$coefficients[1:3] + out11$coefficients["x3"] * aux$coefficients # standard errors of true and underspecified model se_t <- sout11$coefficients[1:3,2]; se_f <- sout22$coefficients[1:3,2]; # output matrix for table tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f) colnames(tab) = c("True_Model", "Missp_Model", "Calc_Coeffs", "se_True_Model","se_Missp_Model") # printing tab # tab ``` ##### Summarizing the results of both simulations ```{r} #| label: tbl-OmitVar2 #| tbl-cap: "True and underspecified models for low and high multicollinearity" #| layout-ncol: 2 #| tbl-subcap: #| - "Low Multicollinearity" #| - "High Multicollinearity" #| code-fold: true lowMultiColl = list( "TrueMod"=out1, "MisspMod"=out2) highMultiColl = list( "TrueMod"=out11, "MisspMod"=out22 ) m1 <- modelsummary(lowMultiColl, stars = TRUE, gof_omit = "A|L|B|F", align = "ldd", output = "flextable") m2 <- modelsummary(highMultiColl, stars = TRUE, gof_omit = "A|L|B|F", align = "ldd", output = "flextable") m1 m2 ``` - For the **high multicollinearity case**, we have much more unprecise estimates even for the *true* model (and we have 1000 observations!). However, the deviations from the true values are not statistically significant - Note that the increase of the standard errors of the *true* model due to a rise of **multicollinearity** is totally in accordance with the variance formula of @eq-covMat_betahat. You can calculate the expected rise of the standards errors using the $R2$ of the two auxiliary equations. The expected and actual increase is in the range of 3.2 to 4 times - For the high multicollinearity case, we have an enormous omitted variable bias as well, which is statistically significant for the variable $x_2$ - Now as before, the *calculated* coefficients (according @eq-omitbias) for the underspecified model exactly matches the *estimated* coefficients for the underspecified model (not shown in the table) - Furthermore, the estimated standard errors are now *lower* for the underspecified model (as expected), but the bias dominates this effect by far in our simulation experiment ::: ------------------------------------------------------------------------ ### Errors in variables - A very common problem in econometrics are **measurement errors** in the model variables - Surprisingly, the consequences of this issue *depend on whether the dependent or explanatory variables* are affected by this problem #### Measurements errors in the **dependent** variable - We assume that the true value of $y$ is *observed* only with a measurement error $$ y^O = y + e $$ {#eq-EiVy} with $y^O$ being the actually *observed* value of $y$ and $e$ being the measurement error - Plugging in for $y$, our base model @eq-BaseModel becomes to $$ y^O = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + ( \underbrace {u+e}_v ) $$ {#eq-EiVy1} - The only *consequence* is that now we have a new error term $v=(u+e)$, with variance $\sigma_v^2 = \sigma^2+\sigma_e^2$, if we assume that the measurement error is uncorrelated with the error term $u$ - Hence, a *measurement error in* $y$ **leads to more unprecise estimates** of all $\beta_j$, i.e., larger variances of $\hat \beta_j$, which is not very surprising ------------------------------------------------------------------------ #### Measurements errors in an **explanatory** variable - We assume that the true value of $x_j$ is only observed with a measurement error $$ x_j^O = x_j + e_j $$ {#eq-EiVx} with $x_j^O$ being the actually *observed* value of $x_j$ and $e_j$ being the associated measurement error - Plugging in for $x_j$, our base model @eq-BaseModel becomes to $$ y = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x_j^O + \cdots + \beta_k x_k + ( \underbrace{u - \beta_j e_j}_v ) $$ {#eq-EiVx1} - As a *consequence*, we have a new error term $v=(u-\beta_je_j)$, with variance $\sigma_v^2 = \sigma^2 + \beta_j^2 \sigma_{e_j}^2$, if we assume that the measurement error is uncorrelated with the error term $u$. Hence, as before, a *measurement error in* $x_j$ **leads to more unprecise estimates** of all $\beta_j$ - However, there is an **additional complication**: Because of @eq-EiVx, the *observed* model variable $x_j^O$ is correlated with $e_j$, leading directly to a correlation of the new error term $v$ with $x_j^O$ and consequently to a **violation of the conditional mean independence assumption** MLR.4. The results are **biased estimates** $\hat \beta$ ------------------------------------------------------------------------ **Summarizing:** - A measurement error in the *dependent variable* leads to more *unprecise* parameter estimates whereas measurement errors in the *explanatory variables* *additionally* lead to *biased estimates*. This is the **error in the variables bias** which typically is a *downward bias* -- **attenuation bias** - However, this result strongly depends on the specification in @eq-EiVx. The crucial point is whether the measurement error is correlated with the observable variable. With the specification of @eq-EiVx it is and this is called the **classical errors-in-variables model** - As an **example** suppose we want to measure *ability* in a wage equation by the IQ. In this case we have $$ IQ = ability + e $$ {#eq-EiVstandard} so that IQ is simply an inaccurate measure of ability with $\operatorname{Corr}(IQ,\, e) > 0$, *leading to the bias described above* #### Proxy variables However, there is **another interpretation**: - Think about ability as a competence which consists of *several* aspects, besides IQ, there is practical talent, diligence, readiness for efforts, work ethics, etc., the latter ones all summarized in $\epsilon$ $$ ability = \delta_0 + \delta_1 IQ + \epsilon $$ {#eq-EiVProxy} - Here, we *implicitly assume* that $\operatorname{Corr}(IQ,\, \epsilon)=0 \, !$ With this *approach* in mind, we simply can plug in @eq-EiVProxy in a standard wage equation, using `IQ` as a so called **proxy variable** for ability: $$ wage = \beta_0 + \beta_1 educ+ \beta_2 exper + \beta_3 (\underbrace { \delta_0 + \delta_1 IQ +\epsilon)}_{ability} + v $$ Collecting terms we get: $$ wage = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ + (v + \beta_3\epsilon) $$ {#eq-EiVProxy1} - Taking the conditional expectation of @eq-EiVProxy1 we have $$ E(wage \mid educ, exper, IQ) = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ + $$ $$ E(v + \beta_3\epsilon \mid educ, exper, IQ) $$ - To get unbiased (or at least consistent) estimates for $\beta_1$ and $\beta_2$, the following conditions must hold - $E(v \mid educ, exper, IQ) = 0,$ which is the *usual* conditional mean independence assumption. However, this condition is now **more likely** to be met because **ability not a part of** $v$ any more - $E(\epsilon \mid educ, exper, IQ) = 0$ - The later is the more interesting one; it demands that the other factors determining ability, $\epsilon$, are as well uncorrelated with the explanatory variables in the wage equation, especially with `educ` - It is doubtful whether this is true in this particular example. However, even if not, using the proxy `IQ` for ability might nevertheless be useful if the correlation of `educ` with $\epsilon$ is smaller than the correlation of `educ` with the unobservable *ability*, thereby *reducing* the omitted variable bias - However, in any case, the researcher has to *carefully argue* that the conditions for the proxy variable approach are actually justifiable on theoretical grounds ------------------------------------------------------------------------ #### Errors in variables -- Example - We are estimating a wage equation; we are looking for the effect of education on log(wage) = `lwage` - Problem: Ability not observable and probably correlated with both `educ` and `lwage` ==\> effect of education probably *overestimated.* We try to tackle this problem by using `IQ` as a proxy variable for unobserved `ability`, hoping $E(\epsilon \mid educ, exper, IQ)$ form above is zero - Possible additional problem, maybe of even more importance: `educ` is measured in years which might be a bad indicator for "real" education ==\> errors in variables problem ==\> underestimation of education ```{r, message=FALSE} #| label: tbl-EiV #| tbl-cap: "Wage equation with different proxy variables for ability (t-statistic in brackets)" library(wooldridge) library(modelsummary) data(wage2) # Model without proxy for ability, but with a lot of control variables out1 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban, data=wage2) # Use of IQ as proxy variable out2 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ, data=wage2) # Use of IQ and KWW as proxy variables out3 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+KWW, data=wage2) # Use of IQ as proxy variable; additional interaction with educ - slope effect out4 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+I(educ*IQ), data=wage2) modelsummary( list(out1, out2, out3, out4), output="gt", statistic = "statistic", coef_omit = "Inter", gof_omit = "A|B|L|F", align = "ldddd", stars = TRUE, fmt = 4) ``` ------------------------------------------------------------------------ ### The random coefficients model {#sec-randomCoef} Until now we always assumed that the coefficient vector $\boldsymbol\beta = (\beta_1, \ldots , \beta_k)$ is *constant across all observations* - But this assumption is quite restrictive and sometimes even implausible; for instance in a wage equation that would mean that the payoff of education is equal for all individuals - We now want to investigate the consequences when some *unobservable* factors affects the coefficients $\boldsymbol\beta$. In particular we assume $$ \boldsymbol\beta_i = \boldsymbol\beta + \boldsymbol \epsilon_i $$ {#eq-random_coef} Here, $\boldsymbol\beta_i$, the coefficients for observation (unit or individual) $i$, is equal to $\boldsymbol\beta$, *the population mean* which we want to estimate, plus an *idiosyncratic error* $\boldsymbol \epsilon_i$, which captures some unobserved factors of unit $i$ affecting the partial effects of the explanatory variables - For instance, in our wage equation example that would mean that the partial effect of education on wage depends on some unobservable (individual specific) factors like diligence or motivation - **Remark**: If the partial effects $\boldsymbol\beta_i$ depend on *observable* factors, we could and should deal with them by *interaction terms*, see @sec-interactive_terms ------------------------------------------------------------------------ This formulation leads the *random coefficients model* with $\mathbf x_i = (x_{1,i}, \ldots, x_{k,i})$ $$ y_i \ = \ \beta_0 + \mathbf x_i \boldsymbol\beta_i + u_i \ = \beta_0 + \ \mathbf x_i(\boldsymbol\beta + \boldsymbol\epsilon_i) + u_i \ \Rightarrow $$ $$ y_i \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + (\mathbf x_i \boldsymbol \epsilon_i + u_i) $$ {#eq-random_coef1} - Hence, the new error term of the model is $(\mathbf x_i \boldsymbol \epsilon_i + u_i)$ and includes $\mathbf x_i$ which could put the unbiasedness and consistency of $\hat {\boldsymbol\beta}$ at risk. Calculating the conditional expectation we get $$ E[y_i|\mathbf X] \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + \mathbf x_i E[\boldsymbol\epsilon_i|\mathbf X] + E[u_i|\mathbf X] $$ - $E[u_i|\mathbf X]=0$ by assumption MLR.4 (exogeneity) - The second requirement for unbiasedness and consistency is $E[\boldsymbol\epsilon_i|\mathbf X]=0$ - So, the crucial assumption is that the unobserved factors affecting $\boldsymbol \beta_i$ are independent from $\mathbf X$ - With reference to our example, unobserved factors like diligence or motivation should be independent from the included explanatory variables like education or experience - But these unobserved factors are frequently the same as those in $u_i$, (the factors in $u_i$ affect the intercept $\beta_0$ and those in $\epsilon_i$ the slopes $\boldsymbol\beta_i$) and so the second requirement is often not an additional constraint - Although OLS **remains unbiased and consistent** under $E[\boldsymbol\epsilon_i|\mathbf X]=0$, the error term $(\mathbf x_i \boldsymbol \epsilon_i + u_i)$ of the random coefficients model **almost certainly exhibits heteroskedasticity** with the conditional variance containing the term $\mathbf x_i \! \operatorname {Var}(\boldsymbol \epsilon_i) \, \mathbf x_i'$ - Some authors even view random coefficients (slopes) to be a *main reason* for heteroskedastic errors ------------------------------------------------------------------------ ### Functional forms With OLS, some non-linearities in the assumed relationship could be modeled by including non-linear transformations of the variables, such as logs, squares or higher order polynomial terms - Thereby, two difficulties arises 1. The partial effects are not constant, as in a linear model, but depend on the values of the variables. So, in order to get an easy interpretable result, you have to calculate the APEs. In R this can be done with the function `avg_slopes()` from the `marginaleffects` package. With `plot_predictions()` you can plot the non-linear relationships at average values of the other variables 2. It is often not clear, whether a linear specification is sufficient or whether there are important non-linearities at work. Furthermore, there are infinitely many possibilities of non-linearities. So, we need a specification test ------------------------------------------------------------------------ #### RESET A general test for **functional form misspecification** is the so called **R**egression **E**quation **S**pecification **E**rror **T**est, or simply RESET - The idea of RESET is to include squares and possibly higher order *fitted values* of $y$ in the regression $$ y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_j + \textcolor {red}{\delta_1 \hat y^2 + \delta_2 \hat y^3} + u $$ {#eq-RESET} - As $\hat{y}$ is a linear combinations of all explanatory variables, terms like $\hat y ^i$ are non-linear transformations of these. So, if the model is correctly specified, terms like $\hat y ^i$ should have no impact on $y$ - With RESET one simply checks whether the terms $\hat y ^i$ are significant, i.e., we test with an F-test whether $\delta_1, \delta_2, \ldots$ are all 0 - This can be done - by hand (estimating the model, storing the predicted values $\hat y$, calculating $\hat y ^2$, $\hat y ^3$ and so on, estimate a second model including these terms and carry out an F-Test) - or more conveniently in R with the function `resettest()` - In the following example we take our wage model and check whether a specification without a non-linear term is sufficient ------------------------------------------------------------------------ ```{r} #| comment: " " library(AER) library(wooldridge); data("wage1") # estimating the linear wage model and storing the results in list myres0 myres0 <- lm(lwage ~ educ + exper, data = wage1) # carrying out a RESET, the default is inclusion of 2nd and 3rd order # terms of predicted values of y resettest(myres0) ``` ```{r} #| comment: " " # So, the H0 is clearly rejected, meaning that the model suffers from # a general misspecification # Estimating the model with non-linear exper and storing the results in list myres1 myres1 <- lm(lwage ~ educ + exper + expersq, data = wage1) # RESET for the full model resettest(myres1) summary(myres1) ``` - Even with `expersq`, `RESET` points to a functional misspecicification of this regression model. Maybe, additional *nonlinear* terms or *interaction* terms are necessary for a correct specification, see @sec-interactive_terms